Upload README.md with huggingface_hub

8f4f3e1 verified 4 months ago

2.26 kB

base_model: Qwen/Qwen3.5-397B-A17B
tags:
  - quantized
  - nvfp4
  - compressed-tensors
  - llm-compressor
  - moe
  - qwen3.5
quantized_by: Sehyo

Qwen3.5-397B-A17B-NVFP4

This is a quantized version of Qwen/Qwen3.5-397B-A17B using the NVFP4 quantization scheme.

Important

Needs this PR from VLLM to work: https://github.com/vllm-project/vllm/pull/34723 You might need to build from source as it is not included in the nightly build yet as I am writing this. Alternatively, patch the latest nightly image yourself to include that PR.

Changelog

02/03/2026: Added MTP (multi-token prediction) weights from source checkpoint, enabling speculative decoding with vLLM.
25/02/2026: Added missing processor configs (preprocessor_config.json, video_preprocessor_config.json, processor_config.json), vocab.json, and restored full tokenizer_config.json from the base model. Fixes vision/video input support and tokenizer loading issues.
22/02/2026: Re-quantized with improved calibration data and parameters. Fixed 14 Inf input_global_scale values caused by rarely-activated experts receiving all-zero activations during calibration. All 92,400 scale tensors now valid. Fixed tokenizer_class and added mlp.gate to quantization ignore list in config.json.
20/02/2026: Reuploaded weights with some issues fixed.

Calibration

Samples: 512 (256 from each dataset)
Datasets:
- HuggingFaceH4/ultrachat_200k (train_sft split)
- nvidia/Nemotron-Post-Training-Dataset-v2 (chat split)
Max sequence length: 4096
All experts calibrated: moe_calibrate_all_experts=True

Creation

This model was created using VLLM's LLM Compressor with Qwen3.5 MoE support added via PR #2383. The PR adds a custom CalibrationQwen3MoeSparseMoeBlock that routes calibration data to all experts during quantization, ensuring every expert receives proper calibration for accurate NVFP4 quantization.