YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Huihui Mistral Medium 3.5 128B Abliterated NVFP4

Experimental ModelOpt NVFP4 quantization of a Mistral3/Pixtral multimodal model.

Status

  • Text generation: working
  • Vision/image input: working after repairing vision-tower Q/K permutation
  • vLLM serving: working with TP=2 and TP=4
  • Quantization format: ModelOpt NVFP4
  • Tested backend: vLLM 0.19.1rc1.dev310+g8ad6ff003

Tested Hardware

  • 2x/4x NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  • 96 GiB VRAM per GPU
  • Tensor parallel size: 2, 4
  • CUDA Graph enabled
  • Quantization: --quantization modelopt

Recommended vLLM Command

CUDA_VISIBLE_DEVICES=4,5 \
NCCL_P2P_DISABLE=1 \
NCCL_NVLS_ENABLE=0 \
NCCL_IB_DISABLE=1 \
NCCL_CUMEM_ENABLE=0 \
vllm serve /path/to/Huihui-Mistral-Medium-3.5-128B-abliterated-NVFP4 \
  --served-model-name huihui-mistral-medium-3.5-128b-nvfp4 \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --quantization modelopt \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --max-num-seqs 8 \
  --disable-custom-all-reduce \
  --host 127.0.0.1 \
  --port 8000

The NCCL environment variables above were required on the tested Blackwell workstation to avoid TP=2 NCCL initialization hangs.

For TP=4, use four GPUs and set --tensor-parallel-size 4:

CUDA_VISIBLE_DEVICES=0,1,4,5 \
NCCL_P2P_DISABLE=1 \
NCCL_NVLS_ENABLE=0 \
NCCL_IB_DISABLE=1 \
NCCL_CUMEM_ENABLE=0 \
vllm serve /path/to/Huihui-Mistral-Medium-3.5-128B-abliterated-NVFP4 \
  --served-model-name huihui-mistral-medium-3.5-128b-nvfp4 \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --quantization modelopt \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --max-num-seqs 8 \
  --disable-custom-all-reduce \
  --host 127.0.0.1 \
  --port 8000

Context Length

Measured with gpu_memory_utilization=0.90, CUDA Graph enabled.

TP --max-model-len Result Notes
2 8,192 OK Good interactive/default serving setting
2 131,072 OK Long-context serving works, max concurrency about 1.19x
2 160,000 Failed vLLM reported insufficient KV cache memory
4 262,144 OK Official config limit; max concurrency about 2.50x
4 524,288 Blocked Above model config max_position_embeddings=262144

vLLM reported for TP=2:

available KV cache memory: 26.25 GiB
estimated maximum model length: 156384

vLLM reported for TP=4 at --max-model-len 262144:

model loading took 28.47 GiB per GPU
available KV cache memory: 54.92 GiB
GPU KV cache size: 654,384 tokens
maximum concurrency for 262,144 tokens per request: 2.50x

Practical recommendation:

  • Use 8192 for interactive WebUI and concurrent serving.
  • Use 131072 for long-context tests.
  • Treat 156384 as the measured theoretical ceiling for this TP=2/96GiBx2 setup.
  • Use TP=4 for the full 262144 context length. Longer contexts require overriding the model config and were not validated.

Throughput TP=2

vLLM online serving benchmark, OpenAI chat endpoint:

  • Dataset: random
  • Input length: 1024 tokens
  • Output length: 256 tokens
  • Prompts: 16
  • --ignore-eos
  • --temperature 0
  • Server: --max-model-len 8192 --max-num-seqs 8
Max concurrency Output TPS Peak output TPS Total token TPS Mean TTFT Mean TPOT
1 19.83 tok/s 21.00 tok/s 127.29 tok/s 538.81 ms 48.50 ms
2 40.45 tok/s 42.00 tok/s 259.63 tok/s 135.81 ms 49.08 ms
4 78.62 tok/s 80.00 tok/s 504.55 tok/s 152.20 ms 50.46 ms
8 151.44 tok/s 160.00 tok/s 971.92 tok/s 125.18 ms 52.37 ms

Raw benchmark files:

/media/tonoken/CT4000/benchmarks/huihui_nvfp4_tp2/serve_c1.json
/media/tonoken/CT4000/benchmarks/huihui_nvfp4_tp2/serve_c2.json
/media/tonoken/CT4000/benchmarks/huihui_nvfp4_tp2/serve_c4.json
/media/tonoken/CT4000/benchmarks/huihui_nvfp4_tp2/serve_c8.json

Throughput TP=4

vLLM online serving benchmark, OpenAI chat endpoint:

  • Dataset: random
  • Input length: 1024 tokens
  • Output length: 256 tokens
  • Prompts: 16
  • --ignore-eos
  • --temperature 0
  • Server: --max-model-len 8192 --max-num-seqs 8
Max concurrency Output TPS Peak output TPS Total token TPS Mean TTFT Mean TPOT
1 34.16 tok/s 38.00 tok/s 219.23 tok/s 506.51 ms 27.40 ms
2 69.67 tok/s 72.00 tok/s 447.13 tok/s 97.66 ms 28.42 ms
4 134.91 tok/s 140.00 tok/s 865.82 tok/s 105.67 ms 29.34 ms
8 243.33 tok/s 248.00 tok/s 1561.69 tok/s 127.23 ms 32.49 ms

Multimodal Notes

This model includes Pixtral vision components:

  • model.vision_tower
  • model.multi_modal_projector
  • PixtralProcessor
  • image tokens: [IMG], [IMG_BREAK], [IMG_END]

The vision tower must also have GGUF Q/K row permutation repaired. Text-only repair of model.language_model.layers.*.self_attn.{q,k}_proj is not enough. Without repairing:

model.vision_tower.transformer.layers.*.attention.{q,k}_proj

the model may identify coarse colors but fail badly on UI/OCR-like images.

After repairing the vision tower Q/K tensors, the model correctly read a KIMI Code console screenshot including:

  • KIMI Code
  • Console
  • Weekly usage
  • Rate limit details
  • Moderato
  • K2.6
  • API Keys

Known Caveats

  • ModelOpt NVFP4 checkpoint support in vLLM is experimental.
  • --reasoning-parser mistral did not work with this HF-converted directory because vLLM expected a MistralTokenizer-compatible tokenizer file.
  • Reasoning High can be passed through chat_template_kwargs, but reasoning text may appear in normal assistant content rather than a separate API field.
  • TP=2 required NCCL P2P/NVLS/IB/CUMEM workarounds on the tested system.

Support the Base Model Author

Downloads last month
56
Safetensors
Model size
87B params
Tensor type
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support