YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Huihui Mistral Medium 3.5 128B Abliterated NVFP4
Experimental ModelOpt NVFP4 quantization of a Mistral3/Pixtral multimodal model.
Status
- Text generation: working
- Vision/image input: working after repairing vision-tower Q/K permutation
- vLLM serving: working with TP=2 and TP=4
- Quantization format: ModelOpt NVFP4
- Tested backend: vLLM
0.19.1rc1.dev310+g8ad6ff003
Tested Hardware
- 2x/4x NVIDIA RTX PRO 6000 Blackwell Workstation Edition
- 96 GiB VRAM per GPU
- Tensor parallel size:
2,4 - CUDA Graph enabled
- Quantization:
--quantization modelopt
Recommended vLLM Command
CUDA_VISIBLE_DEVICES=4,5 \
NCCL_P2P_DISABLE=1 \
NCCL_NVLS_ENABLE=0 \
NCCL_IB_DISABLE=1 \
NCCL_CUMEM_ENABLE=0 \
vllm serve /path/to/Huihui-Mistral-Medium-3.5-128B-abliterated-NVFP4 \
--served-model-name huihui-mistral-medium-3.5-128b-nvfp4 \
--trust-remote-code \
--tensor-parallel-size 2 \
--quantization modelopt \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--max-num-seqs 8 \
--disable-custom-all-reduce \
--host 127.0.0.1 \
--port 8000
The NCCL environment variables above were required on the tested Blackwell workstation to avoid TP=2 NCCL initialization hangs.
For TP=4, use four GPUs and set --tensor-parallel-size 4:
CUDA_VISIBLE_DEVICES=0,1,4,5 \
NCCL_P2P_DISABLE=1 \
NCCL_NVLS_ENABLE=0 \
NCCL_IB_DISABLE=1 \
NCCL_CUMEM_ENABLE=0 \
vllm serve /path/to/Huihui-Mistral-Medium-3.5-128B-abliterated-NVFP4 \
--served-model-name huihui-mistral-medium-3.5-128b-nvfp4 \
--trust-remote-code \
--tensor-parallel-size 4 \
--quantization modelopt \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--max-num-seqs 8 \
--disable-custom-all-reduce \
--host 127.0.0.1 \
--port 8000
Context Length
Measured with gpu_memory_utilization=0.90, CUDA Graph enabled.
| TP | --max-model-len |
Result | Notes |
|---|---|---|---|
| 2 | 8,192 | OK | Good interactive/default serving setting |
| 2 | 131,072 | OK | Long-context serving works, max concurrency about 1.19x |
| 2 | 160,000 | Failed | vLLM reported insufficient KV cache memory |
| 4 | 262,144 | OK | Official config limit; max concurrency about 2.50x |
| 4 | 524,288 | Blocked | Above model config max_position_embeddings=262144 |
vLLM reported for TP=2:
available KV cache memory: 26.25 GiB
estimated maximum model length: 156384
vLLM reported for TP=4 at --max-model-len 262144:
model loading took 28.47 GiB per GPU
available KV cache memory: 54.92 GiB
GPU KV cache size: 654,384 tokens
maximum concurrency for 262,144 tokens per request: 2.50x
Practical recommendation:
- Use
8192for interactive WebUI and concurrent serving. - Use
131072for long-context tests. - Treat
156384as the measured theoretical ceiling for this TP=2/96GiBx2 setup. - Use TP=4 for the full
262144context length. Longer contexts require overriding the model config and were not validated.
Throughput TP=2
vLLM online serving benchmark, OpenAI chat endpoint:
- Dataset: random
- Input length: 1024 tokens
- Output length: 256 tokens
- Prompts: 16
--ignore-eos--temperature 0- Server:
--max-model-len 8192 --max-num-seqs 8
| Max concurrency | Output TPS | Peak output TPS | Total token TPS | Mean TTFT | Mean TPOT |
|---|---|---|---|---|---|
| 1 | 19.83 tok/s | 21.00 tok/s | 127.29 tok/s | 538.81 ms | 48.50 ms |
| 2 | 40.45 tok/s | 42.00 tok/s | 259.63 tok/s | 135.81 ms | 49.08 ms |
| 4 | 78.62 tok/s | 80.00 tok/s | 504.55 tok/s | 152.20 ms | 50.46 ms |
| 8 | 151.44 tok/s | 160.00 tok/s | 971.92 tok/s | 125.18 ms | 52.37 ms |
Raw benchmark files:
/media/tonoken/CT4000/benchmarks/huihui_nvfp4_tp2/serve_c1.json
/media/tonoken/CT4000/benchmarks/huihui_nvfp4_tp2/serve_c2.json
/media/tonoken/CT4000/benchmarks/huihui_nvfp4_tp2/serve_c4.json
/media/tonoken/CT4000/benchmarks/huihui_nvfp4_tp2/serve_c8.json
Throughput TP=4
vLLM online serving benchmark, OpenAI chat endpoint:
- Dataset: random
- Input length: 1024 tokens
- Output length: 256 tokens
- Prompts: 16
--ignore-eos--temperature 0- Server:
--max-model-len 8192 --max-num-seqs 8
| Max concurrency | Output TPS | Peak output TPS | Total token TPS | Mean TTFT | Mean TPOT |
|---|---|---|---|---|---|
| 1 | 34.16 tok/s | 38.00 tok/s | 219.23 tok/s | 506.51 ms | 27.40 ms |
| 2 | 69.67 tok/s | 72.00 tok/s | 447.13 tok/s | 97.66 ms | 28.42 ms |
| 4 | 134.91 tok/s | 140.00 tok/s | 865.82 tok/s | 105.67 ms | 29.34 ms |
| 8 | 243.33 tok/s | 248.00 tok/s | 1561.69 tok/s | 127.23 ms | 32.49 ms |
Multimodal Notes
This model includes Pixtral vision components:
model.vision_towermodel.multi_modal_projectorPixtralProcessor- image tokens:
[IMG],[IMG_BREAK],[IMG_END]
The vision tower must also have GGUF Q/K row permutation repaired. Text-only repair of model.language_model.layers.*.self_attn.{q,k}_proj is not enough. Without repairing:
model.vision_tower.transformer.layers.*.attention.{q,k}_proj
the model may identify coarse colors but fail badly on UI/OCR-like images.
After repairing the vision tower Q/K tensors, the model correctly read a KIMI Code console screenshot including:
KIMI CodeConsoleWeekly usageRate limit detailsModeratoK2.6API Keys
Known Caveats
- ModelOpt NVFP4 checkpoint support in vLLM is experimental.
--reasoning-parser mistraldid not work with this HF-converted directory because vLLM expected a MistralTokenizer-compatible tokenizer file.- Reasoning High can be passed through
chat_template_kwargs, but reasoning text may appear in normal assistant content rather than a separate API field. - TP=2 required NCCL P2P/NVLS/IB/CUMEM workarounds on the tested system.
Support the Base Model Author
- Ko-fi: https://ko-fi.com/huihuiai
- Bitcoin:
bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
- Downloads last month
- 56