jduartedj commited on
Commit
500cc5d
·
verified ·
1 Parent(s): 8f5238b

Update config, tokenizer, README

Browse files
Files changed (2) hide show
  1. README.md +84 -73
  2. model.safetensors.index.json +0 -0
README.md CHANGED
@@ -6,7 +6,10 @@ language:
6
  base_model:
7
  - openbmb/MiniCPM-V-4.6
8
  - huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated
 
9
  tags:
 
 
10
  - multimodal
11
  - vision
12
  - abliterated
@@ -15,97 +18,105 @@ tags:
15
  - minicpm
16
  - moe
17
  - vision-language
18
- pipeline_tag: image-text-to-text
 
19
  ---
20
 
21
- # MiniCPM-V 4.635B-A3B Abliterated (MoE)
22
-
23
- A vision-language model built by swapping [MiniCPM-V 4.6](https://huggingface.co/openbmb/MiniCPM-V-4.6)'s original Qwen3.5-0.8B backbone with [Qwen3.5-35B-A3B Abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated), a Mixture-of-Experts model with refusal behavior removed.
24
-
25
- ## ⚠️ Experimental
26
-
27
- This is an experimental backbone swap. The vision-language merger (vit_merger) MLP was **resized** from 1024 → 2048 output dimensions using Xavier initialization but was **not fine-tuned**. As a result:
28
-
29
- - **Text-only tasks work well** with the abliterated MoE backbone
30
- - **Vision tasks are degraded** the merger cannot properly project visual features to the new LLM hidden dimension without retraining
31
- - Fine-tuning the merger on image-text pairs is needed to restore vision capabilities
32
-
33
- ## Specs
34
-
35
- | Component | Details |
36
- |-----------|---------|
37
- | **Architecture** | MiniCPMV4_6ForConditionalGeneration |
38
- | **LLM Backbone** | Qwen3.5-35B-A3B Abliterated (MoE) |
39
- | **Total Parameters** | ~35B (3B active per token) |
40
- | **Hidden Size** | 2048 |
41
- | **LLM Layers** | 40 |
42
- | **Experts** | 256 total, 8 active per token |
43
- | **Attention** | 16 heads (2 KV heads), hybrid linear/full |
44
- | **Context Length** | 262,144 tokens |
45
- | **Vision Encoder** | SigLip2-400M (27 layers, hidden=1152) |
46
- | **Vocab Size** | 248,320 |
47
- | **Total Size** | ~69 GB (BF16) |
48
- | **Precision** | BF16 |
49
- | **Min VRAM** | ~80 GB |
50
-
51
- ## What Changed
52
-
53
- | Component | Original MiniCPM-V 4.6 | This Model |
54
- |-----------|------------------------|------------|
55
- | **LLM Backbone** | Qwen3.5-0.8B (dense) | Qwen3.5-35B-A3B **Abliterated** (MoE) |
56
- | **Hidden Size** | 1024 | 2048 |
57
- | **Merger MLP** | 1024-dim output | **Resized to 2048** (Xavier init, NOT trained) |
58
- | **Vision Encoder** | SigLip2-400M | SigLip2-400M (unchanged) |
59
- | **Refusal Behavior** | Standard guardrails | Removed via abliteration |
60
-
61
- ## Architecture Details
62
-
63
- The model uses Qwen3.5's hybrid attention pattern with alternating linear and full attention layers (3:1 ratio). The MoE architecture routes each token to 8 out of 256 experts, with shared expert layers for stability.
64
-
65
- - **Linear attention layers**: Use conv1d kernels for efficient sequence processing
66
- - **Full attention layers**: Standard multi-head attention every 4th layer
67
- - **MoE routing**: Top-8 gating per token from 256 expert MLPs (intermediate_size=512 each)
68
- - **Shared expert**: Always-active expert (intermediate_size=512) at each MoE layer
69
 
70
  ## Usage
71
 
72
  ```python
73
- import torch
74
- from transformers import AutoModel, AutoTokenizer
75
  from PIL import Image
76
 
77
- model = AutoModel.from_pretrained(
78
  "jduartedj/MiniCPM-V-4.6-35B-Abliterated",
79
  trust_remote_code=True,
80
- torch_dtype=torch.bfloat16
 
81
  )
82
- model = model.eval().cuda()
83
- tokenizer = AutoTokenizer.from_pretrained(
84
  "jduartedj/MiniCPM-V-4.6-35B-Abliterated",
85
- trust_remote_code=True
86
  )
87
 
88
- # Text-only (abliterated — works well)
89
- msgs = [{"role": "user", "content": "Explain quantum computing without restrictions."}]
90
- result = model.chat(msgs=msgs, tokenizer=tokenizer)
91
- print(result)
92
-
93
- # Vision (experimental — merger not fine-tuned)
94
- image = Image.open("example.jpg")
95
- msgs = [{"role": "user", "content": [image, "Describe this image."]}]
96
- result = model.chat(msgs=msgs, tokenizer=tokenizer)
97
- print(result)
 
 
98
  ```
99
 
 
 
 
 
 
 
 
 
100
  ## Limitations
101
 
102
- - **Merger MLP not fine-tuned**: The vit_merger was resized from 1024→2048 with Xavier initialization. Vision-language alignment is broken until fine-tuned on image-text data.
103
- - **Large model**: Requires ~80GB VRAM for inference. Multi-GPU or offloading needed for most setups.
104
- - **No benchmarks**: Not formally evaluated on any vision-language benchmark.
105
- - **Experimental**: For research and development only.
106
 
107
  ## Credits
108
 
109
- - [OpenBMB](https://github.com/OpenBMB) for [MiniCPM-V 4.6](https://huggingface.co/openbmb/MiniCPM-V-4.6) architecture and codebase
110
- - [huihui-ai](https://huggingface.co/huihui-ai) for [Qwen3.5-35B-A3B Abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated)
111
- - Built by [jduartedj](https://huggingface.co/jduartedj)
 
 
 
 
 
6
  base_model:
7
  - openbmb/MiniCPM-V-4.6
8
  - huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated
9
+ pipeline_tag: image-text-to-text
10
  tags:
11
+ - safetensors
12
+ - minicpmv4_6
13
  - multimodal
14
  - vision
15
  - abliterated
 
18
  - minicpm
19
  - moe
20
  - vision-language
21
+ - image-text-to-text
22
+ - conversational
23
  ---
24
 
25
+ # MiniCPM-V-4.6-35B-Abliterated
26
+
27
+ A multimodal vision-language model combining:
28
+ - **Vision:** [openbmb/MiniCPM-V-4.6](https://huggingface.co/openbmb/MiniCPM-V-4.6) vision tower (SigLIP 400M, 27 encoder layers + ViT merger)
29
+ - **Language:** [huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated) (Qwen3.5-35B-A3B with abliteration for uncensored text generation)
30
+ - **Merger:** Trained MLP bridge (4608→2048) connecting vision to language
31
+
32
+ ## Architecture
33
+
34
+ | Component | Source | Parameters | Status |
35
+ |-----------|--------|------------|--------|
36
+ | Vision Tower | openbmb/MiniCPM-V-4.6 | 522M | Frozen (original weights) |
37
+ | ViT Merger | openbmb/MiniCPM-V-4.6 | ~25M | Frozen (original weights) |
38
+ | Merger MLP | Trained | 30.7M | **Trained** (proxy MSE loss) |
39
+ | Language Model | huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated | ~35B (3B active MoE) | Abliterated weights |
40
+
41
+ The merger is a single `DownsampleMLP` layer:
42
+ - Input: 4608-dim (2×2 spatial merge of 1152-dim vision patches)
43
+ - `LayerNorm(4608)` `Linear(4608→4608)` `GELU` `Linear(4608→2048)`
44
+ - Output: 2048-dim (LLM embedding space)
45
+
46
+ ## Merger Training Details
47
+
48
+ The merger was trained using a **proxy MSE loss** approach:
49
+ - **Dataset:** LLaVA-Pretrain (558K image-caption pairs from BLIP/LAION/CC/SBU)
50
+ - **Method:** `MSE(mean(merger(vision_tower(image))), mean(embed_tokens(caption)))`
51
+ - **Only merger weights trained** vision tower and LLM frozen
52
+ - **Standalone training** loaded only vision tower + merger + embed_tokens (~2.4GB GPU)
53
+
54
+ ### Training Metrics
55
+ | Metric | Start | End |
56
+ |--------|-------|-----|
57
+ | MSE Loss | 0.548 | 0.0006 |
58
+ | Cosine Similarity | 0.05 | 0.10-0.12 |
59
+
60
+ ### Hyperparameters
61
+ - Learning rate: 1e-4 with 500-step warmup + cosine decay
62
+ - Optimizer: AdamW (β1=0.9, β2=0.999, weight_decay=0.01)
63
+ - Steps: 20,000
64
+ - Batch size: 1
65
+ - Gradient clipping: max_norm=1.0
66
+ - Hardware: NVIDIA GB10 (128GB unified memory)
67
+ - Training time: ~55 minutes
 
 
 
 
 
68
 
69
  ## Usage
70
 
71
  ```python
72
+ from transformers import AutoModelForCausalLM, AutoProcessor
 
73
  from PIL import Image
74
 
75
+ model = AutoModelForCausalLM.from_pretrained(
76
  "jduartedj/MiniCPM-V-4.6-35B-Abliterated",
77
  trust_remote_code=True,
78
+ torch_dtype="auto",
79
+ device_map="auto",
80
  )
81
+ processor = AutoProcessor.from_pretrained(
 
82
  "jduartedj/MiniCPM-V-4.6-35B-Abliterated",
83
+ trust_remote_code=True,
84
  )
85
 
86
+ image = Image.open("your_image.jpg").convert("RGB")
87
+ messages = [
88
+ {"role": "user", "content": [
89
+ {"type": "image"},
90
+ {"type": "text", "text": "Describe this image in detail."},
91
+ ]},
92
+ ]
93
+
94
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
95
+ inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
96
+ output = model.generate(**inputs, max_new_tokens=512)
97
+ print(processor.decode(output[0], skip_special_tokens=True))
98
  ```
99
 
100
+ ## Requirements
101
+
102
+ - `transformers >= 5.7.0` (native `minicpmv4_6` support)
103
+ - `torch >= 2.1.0`
104
+ - `torchvision`
105
+ - ~67GB disk space for weights
106
+ - ~75GB+ GPU memory for inference (or use quantization)
107
+
108
  ## Limitations
109
 
110
+ - The merger was trained with proxy MSE loss (image embedding caption embedding), not end-to-end. Vision-language alignment may not be as strong as fully fine-tuned models.
111
+ - The abliterated LLM may produce unfiltered content use responsibly.
112
+ - Cosine similarity between vision and text embeddings reaches ~0.10-0.12, indicating meaningful but not perfect alignment.
 
113
 
114
  ## Credits
115
 
116
+ - **[openbmb](https://huggingface.co/openbmb)** MiniCPM-V-4.6 vision architecture and weights
117
+ - **[huihui-ai](https://huggingface.co/huihui-ai)** Abliterated Qwen3.5-35B-A3B language model
118
+ - **Assembly & merger training** by [jduartedj](https://huggingface.co/jduartedj)
119
+
120
+ ## License
121
+
122
+ Apache 2.0
model.safetensors.index.json CHANGED
The diff for this file is too large to render. See raw diff