Huihui Mistral Medium 3.5 128B Abliterated NVFP4

Experimental ModelOpt NVFP4 quantization of a Mistral3/Pixtral multimodal model.

Status

Text generation: working
Vision/image input: working after repairing vision-tower Q/K permutation
vLLM serving: working with TP=2 and TP=4
Quantization format: ModelOpt NVFP4
Tested backend: vLLM 0.19.1rc1.dev310+g8ad6ff003

Tested Hardware

2x/4x NVIDIA RTX PRO 6000 Blackwell Workstation Edition
96 GiB VRAM per GPU
Tensor parallel size: 2, 4
CUDA Graph enabled
Quantization: --quantization modelopt

Recommended vLLM Command

CUDA_VISIBLE_DEVICES=4,5 \
NCCL_P2P_DISABLE=1 \
NCCL_NVLS_ENABLE=0 \
NCCL_IB_DISABLE=1 \
NCCL_CUMEM_ENABLE=0 \
vllm serve /path/to/Huihui-Mistral-Medium-3.5-128B-abliterated-NVFP4 \
  --served-model-name huihui-mistral-medium-3.5-128b-nvfp4 \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --quantization modelopt \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --max-num-seqs 8 \
  --disable-custom-all-reduce \
  --host 127.0.0.1 \
  --port 8000

The NCCL environment variables above were required on the tested Blackwell workstation to avoid TP=2 NCCL initialization hangs.

For TP=4, use four GPUs and set --tensor-parallel-size 4:

CUDA_VISIBLE_DEVICES=0,1,4,5 \
NCCL_P2P_DISABLE=1 \
NCCL_NVLS_ENABLE=0 \
NCCL_IB_DISABLE=1 \
NCCL_CUMEM_ENABLE=0 \
vllm serve /path/to/Huihui-Mistral-Medium-3.5-128B-abliterated-NVFP4 \
  --served-model-name huihui-mistral-medium-3.5-128b-nvfp4 \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --quantization modelopt \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --max-num-seqs 8 \
  --disable-custom-all-reduce \
  --host 127.0.0.1 \
  --port 8000

Context Length

Measured with gpu_memory_utilization=0.90, CUDA Graph enabled.

TP	`--max-model-len`	Result	Notes
2	8,192	OK	Good interactive/default serving setting
2	131,072	OK	Long-context serving works, max concurrency about 1.19x
2	160,000	Failed	vLLM reported insufficient KV cache memory
4	262,144	OK	Official config limit; max concurrency about 2.50x
4	524,288	Blocked	Above model config `max_position_embeddings=262144`

vLLM reported for TP=2:

available KV cache memory: 26.25 GiB
estimated maximum model length: 156384

vLLM reported for TP=4 at --max-model-len 262144:

model loading took 28.47 GiB per GPU
available KV cache memory: 54.92 GiB
GPU KV cache size: 654,384 tokens
maximum concurrency for 262,144 tokens per request: 2.50x

Practical recommendation:

Use 8192 for interactive WebUI and concurrent serving.
Use 131072 for long-context tests.
Treat 156384 as the measured theoretical ceiling for this TP=2/96GiBx2 setup.
Use TP=4 for the full 262144 context length. Longer contexts require overriding the model config and were not validated.

Throughput TP=2

vLLM online serving benchmark, OpenAI chat endpoint:

Dataset: random
Input length: 1024 tokens
Output length: 256 tokens
Prompts: 16
--ignore-eos
--temperature 0
Server: --max-model-len 8192 --max-num-seqs 8

Max concurrency	Output TPS	Peak output TPS	Total token TPS	Mean TTFT	Mean TPOT
1	19.83 tok/s	21.00 tok/s	127.29 tok/s	538.81 ms	48.50 ms
2	40.45 tok/s	42.00 tok/s	259.63 tok/s	135.81 ms	49.08 ms
4	78.62 tok/s	80.00 tok/s	504.55 tok/s	152.20 ms	50.46 ms
8	151.44 tok/s	160.00 tok/s	971.92 tok/s	125.18 ms	52.37 ms

Raw benchmark files:

/media/tonoken/CT4000/benchmarks/huihui_nvfp4_tp2/serve_c1.json
/media/tonoken/CT4000/benchmarks/huihui_nvfp4_tp2/serve_c2.json
/media/tonoken/CT4000/benchmarks/huihui_nvfp4_tp2/serve_c4.json
/media/tonoken/CT4000/benchmarks/huihui_nvfp4_tp2/serve_c8.json

Throughput TP=4

vLLM online serving benchmark, OpenAI chat endpoint:

Dataset: random
Input length: 1024 tokens
Output length: 256 tokens
Prompts: 16
--ignore-eos
--temperature 0
Server: --max-model-len 8192 --max-num-seqs 8

Max concurrency	Output TPS	Peak output TPS	Total token TPS	Mean TTFT	Mean TPOT
1	34.16 tok/s	38.00 tok/s	219.23 tok/s	506.51 ms	27.40 ms
2	69.67 tok/s	72.00 tok/s	447.13 tok/s	97.66 ms	28.42 ms
4	134.91 tok/s	140.00 tok/s	865.82 tok/s	105.67 ms	29.34 ms
8	243.33 tok/s	248.00 tok/s	1561.69 tok/s	127.23 ms	32.49 ms

Multimodal Notes

This model includes Pixtral vision components:

model.vision_tower
model.multi_modal_projector
PixtralProcessor
image tokens: [IMG], [IMG_BREAK], [IMG_END]

The vision tower must also have GGUF Q/K row permutation repaired. Text-only repair of model.language_model.layers.*.self_attn.{q,k}_proj is not enough. Without repairing:

model.vision_tower.transformer.layers.*.attention.{q,k}_proj

the model may identify coarse colors but fail badly on UI/OCR-like images.

After repairing the vision tower Q/K tensors, the model correctly read a KIMI Code console screenshot including:

KIMI Code
Console
Weekly usage
Rate limit details
Moderato
K2.6
API Keys

Known Caveats

ModelOpt NVFP4 checkpoint support in vLLM is experimental.
--reasoning-parser mistral did not work with this HF-converted directory because vLLM expected a MistralTokenizer-compatible tokenizer file.
Reasoning High can be passed through chat_template_kwargs, but reasoning text may appear in normal assistant content rather than a separate API field.
TP=2 required NCCL P2P/NVLS/IB/CUMEM workarounds on the tested system.

Support the Base Model Author

Ko-fi: https://ko-fi.com/huihuiai
Bitcoin: bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge

Downloads last month: 56

Safetensors

Model size

87B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support