---
base_model: Qwen/Qwen3.5-397B-A17B
tags:
  - quantized
  - nvfp4
  - compressed-tensors
  - llm-compressor
  - moe
  - qwen3.5
quantized_by: Sehyo
---

# Qwen3.5-397B-A17B-NVFP4

This is a quantized version of [Qwen/Qwen3.5-397B-A17B](https://huggingface.co/Qwen/Qwen3.5-397B-A17B) using the **NVFP4** quantization scheme.

## Important
Needs this PR from VLLM to work: https://github.com/vllm-project/vllm/pull/34723
You might need to build from source as it is not included in the nightly build yet as I am writing this.
Alternatively, patch the latest nightly image yourself to include that PR.

## Changelog
- **02/03/2026**: Added MTP (multi-token prediction) weights from source checkpoint, enabling speculative decoding with vLLM.
- **25/02/2026**: Added missing processor configs (`preprocessor_config.json`, `video_preprocessor_config.json`, `processor_config.json`), `vocab.json`, and restored full `tokenizer_config.json` from the base model. Fixes vision/video input support and tokenizer loading issues.
- **22/02/2026**: Re-quantized with improved calibration data and parameters. Fixed 14 Inf `input_global_scale` values caused by rarely-activated experts receiving all-zero activations during calibration. All 92,400 scale tensors now valid. Fixed `tokenizer_class` and added `mlp.gate` to quantization ignore list in `config.json`.
- **20/02/2026**: Reuploaded weights with some issues fixed.

## Calibration

- **Samples**: 512 (256 from each dataset)
- **Datasets**:
  - [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (`train_sft` split)
  - [nvidia/Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) (`chat` split)
- **Max sequence length**: 4096
- **All experts calibrated**: `moe_calibrate_all_experts=True`

## Creation

This model was created using [VLLM's LLM Compressor](https://github.com/vllm-project/llm-compressor) with Qwen3.5 MoE support added via [PR #2383](https://github.com/vllm-project/llm-compressor/pull/2383). The PR adds a custom `CalibrationQwen3MoeSparseMoeBlock` that routes calibration data to all experts during quantization, ensuring every expert receives proper calibration for accurate NVFP4 quantization.