---
license: apache-2.0
language:
- en
- zh
base_model:
- openbmb/MiniCPM-V-4.6
- huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated
pipeline_tag: image-text-to-text
tags:
- safetensors
- minicpmv4_6
- multimodal
- vision
- abliterated
- uncensored
- qwen3.5
- minicpm
- moe
- vision-language
- image-text-to-text
- conversational
---

# MiniCPM-V-4.6-35B-Abliterated

A multimodal vision-language model combining:
- **Vision:** [openbmb/MiniCPM-V-4.6](https://huggingface.co/openbmb/MiniCPM-V-4.6) vision tower (SigLIP 400M, 27 encoder layers + ViT merger)
- **Language:** [huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated) (Qwen3.5-35B-A3B with abliteration for uncensored text generation)
- **Merger:** Trained MLP bridge (4608→2048) connecting vision to language

## Architecture

| Component | Source | Parameters | Status |
|-----------|--------|------------|--------|
| Vision Tower | openbmb/MiniCPM-V-4.6 | 522M | Frozen (original weights) |
| ViT Merger | openbmb/MiniCPM-V-4.6 | ~25M | Frozen (original weights) |
| Merger MLP | Trained | 30.7M | **Trained** (proxy MSE loss) |
| Language Model | huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated | ~35B (3B active MoE) | Abliterated weights |

The merger is a single `DownsampleMLP` layer:
- Input: 4608-dim (2×2 spatial merge of 1152-dim vision patches)
- `LayerNorm(4608)` → `Linear(4608→4608)` → `GELU` → `Linear(4608→2048)`
- Output: 2048-dim (LLM embedding space)

## Merger Training Details

The merger was trained using a **proxy MSE loss** approach:
- **Dataset:** LLaVA-Pretrain (558K image-caption pairs from BLIP/LAION/CC/SBU)
- **Method:** `MSE(mean(merger(vision_tower(image))), mean(embed_tokens(caption)))`
- **Only merger weights trained** — vision tower and LLM frozen
- **Standalone training** — loaded only vision tower + merger + embed_tokens (~2.4GB GPU)

### Training Metrics
| Metric | Start | End |
|--------|-------|-----|
| MSE Loss | 0.548 | 0.0006 |
| Cosine Similarity | 0.05 | 0.10-0.12 |

### Hyperparameters
- Learning rate: 1e-4 with 500-step warmup + cosine decay
- Optimizer: AdamW (β1=0.9, β2=0.999, weight_decay=0.01)
- Steps: 20,000
- Batch size: 1
- Gradient clipping: max_norm=1.0
- Hardware: NVIDIA GB10 (128GB unified memory)
- Training time: ~55 minutes

## Usage

```python
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image

model = AutoModelForCausalLM.from_pretrained(
    "jduartedj/MiniCPM-V-4.6-35B-Abliterated",
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(
    "jduartedj/MiniCPM-V-4.6-35B-Abliterated",
    trust_remote_code=True,
)

image = Image.open("your_image.jpg").convert("RGB")
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe this image in detail."},
    ]},
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))
```

## Requirements

- `transformers >= 5.7.0` (native `minicpmv4_6` support)
- `torch >= 2.1.0`
- `torchvision`
- ~67GB disk space for weights
- ~75GB+ GPU memory for inference (or use quantization)

## Limitations

- The merger was trained with proxy MSE loss (image embedding ↔ caption embedding), not end-to-end. Vision-language alignment may not be as strong as fully fine-tuned models.
- The abliterated LLM may produce unfiltered content — use responsibly.
- Cosine similarity between vision and text embeddings reaches ~0.10-0.12, indicating meaningful but not perfect alignment.

## Credits

- **[openbmb](https://huggingface.co/openbmb)** — MiniCPM-V-4.6 vision architecture and weights
- **[huihui-ai](https://huggingface.co/huihui-ai)** — Abliterated Qwen3.5-35B-A3B language model
- **Assembly & merger training** by [jduartedj](https://huggingface.co/jduartedj)

## License

Apache 2.0