--- license: apache-2.0 language: - en - zh base_model: - openbmb/MiniCPM-V-4.6 - huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated pipeline_tag: image-text-to-text tags: - safetensors - minicpmv4_6 - multimodal - vision - abliterated - uncensored - qwen3.5 - minicpm - moe - vision-language - image-text-to-text - conversational --- # MiniCPM-V-4.6-35B-Abliterated A multimodal vision-language model combining: - **Vision:** [openbmb/MiniCPM-V-4.6](https://huggingface.co/openbmb/MiniCPM-V-4.6) vision tower (SigLIP 400M, 27 encoder layers + ViT merger) - **Language:** [huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated) (Qwen3.5-35B-A3B with abliteration for uncensored text generation) - **Merger:** Trained MLP bridge (4608→2048) connecting vision to language ## Architecture | Component | Source | Parameters | Status | |-----------|--------|------------|--------| | Vision Tower | openbmb/MiniCPM-V-4.6 | 522M | Frozen (original weights) | | ViT Merger | openbmb/MiniCPM-V-4.6 | ~25M | Frozen (original weights) | | Merger MLP | Trained | 30.7M | **Trained** (proxy MSE loss) | | Language Model | huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated | ~35B (3B active MoE) | Abliterated weights | The merger is a single `DownsampleMLP` layer: - Input: 4608-dim (2×2 spatial merge of 1152-dim vision patches) - `LayerNorm(4608)` → `Linear(4608→4608)` → `GELU` → `Linear(4608→2048)` - Output: 2048-dim (LLM embedding space) ## Merger Training Details The merger was trained using a **proxy MSE loss** approach: - **Dataset:** LLaVA-Pretrain (558K image-caption pairs from BLIP/LAION/CC/SBU) - **Method:** `MSE(mean(merger(vision_tower(image))), mean(embed_tokens(caption)))` - **Only merger weights trained** — vision tower and LLM frozen - **Standalone training** — loaded only vision tower + merger + embed_tokens (~2.4GB GPU) ### Training Metrics | Metric | Start | End | |--------|-------|-----| | MSE Loss | 0.548 | 0.0006 | | Cosine Similarity | 0.05 | 0.10-0.12 | ### Hyperparameters - Learning rate: 1e-4 with 500-step warmup + cosine decay - Optimizer: AdamW (β1=0.9, β2=0.999, weight_decay=0.01) - Steps: 20,000 - Batch size: 1 - Gradient clipping: max_norm=1.0 - Hardware: NVIDIA GB10 (128GB unified memory) - Training time: ~55 minutes ## Usage ```python from transformers import AutoModelForCausalLM, AutoProcessor from PIL import Image model = AutoModelForCausalLM.from_pretrained( "jduartedj/MiniCPM-V-4.6-35B-Abliterated", trust_remote_code=True, torch_dtype="auto", device_map="auto", ) processor = AutoProcessor.from_pretrained( "jduartedj/MiniCPM-V-4.6-35B-Abliterated", trust_remote_code=True, ) image = Image.open("your_image.jpg").convert("RGB") messages = [ {"role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Describe this image in detail."}, ]}, ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device) output = model.generate(**inputs, max_new_tokens=512) print(processor.decode(output[0], skip_special_tokens=True)) ``` ## Requirements - `transformers >= 5.7.0` (native `minicpmv4_6` support) - `torch >= 2.1.0` - `torchvision` - ~67GB disk space for weights - ~75GB+ GPU memory for inference (or use quantization) ## Limitations - The merger was trained with proxy MSE loss (image embedding ↔ caption embedding), not end-to-end. Vision-language alignment may not be as strong as fully fine-tuned models. - The abliterated LLM may produce unfiltered content — use responsibly. - Cosine similarity between vision and text embeddings reaches ~0.10-0.12, indicating meaningful but not perfect alignment. ## Credits - **[openbmb](https://huggingface.co/openbmb)** — MiniCPM-V-4.6 vision architecture and weights - **[huihui-ai](https://huggingface.co/huihui-ai)** — Abliterated Qwen3.5-35B-A3B language model - **Assembly & merger training** by [jduartedj](https://huggingface.co/jduartedj) ## License Apache 2.0