---
pipeline_tag: image-text-to-text
base_model:
- Qwen/Qwen3.5-2B
license: apache-2.0
library_name: transformers
tags:
- Surogate
- ModelOpt
- Qwen3.5
- quantized
- NVFP4
- nvfp4
- sglang
---
# Qwen3.5-2B-NVFP4
**This Qwen3.5 variant is recommended for Surogate on Blackwell NVIDIA GPUs. Check out [http://surogate.ai](http://surogate.ai)**
This is an NVFP4-quantized version of [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) (9B parameters), quantized using [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer). Weights and activations of linear layers are quantized to FP4, reducing disk size and GPU memory by ~4x compared to BF16.
**About NVFP4 quantization:** NVFP4 on Blackwell couples a compact E2M1 FP4 codebook with blockwise FP8 (E4M3) scaling over 16-element micro-blocks, so that 4-bit stored values remain numerically useful for neural-network computation. The E2M1 codebook provides a small, nonuniform set of representable magnitudes up to ±6 and relies on saturating behavior rather than IEEE NaN/Inf encodings to maximize usable range per bit. Using an FP8 block scale (rather than power-of-two-only E8M0) enables fractional scales and error-minimizing scale selection strategies such as dual-pass evaluation comparing "map max to 6" versus "map max to 4 with clipping." On Blackwell Tensor Cores, native FP4 multipliers exploit E2M1 simplicity to reduce multiplier area while higher-precision FP32 accumulation protects dot-product accuracy.
Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with unprecedented capability and efficiency.
Credits to [AxionML](https://huggingface.co/AxionML) for quantizing this model.
## Qwen3.5 Highlights
Qwen3.5 features the following enhancement:
- **Unified Vision-Language Foundation**: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks.
- **Efficient Hybrid Architecture**: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead.
- **Scalable RL Generalization**: Reinforcement learning scaled across million-agent environments with progressively complex task distributions for robust real-world adaptability.
- **Global Linguistic Coverage**: Expanded support to 201 languages and dialects, enabling inclusive, worldwide deployment with nuanced cultural and regional understanding.
- **Next-Generation Training Infrastructure**: Near-100% multimodal training efficiency compared to text-only training and asynchronous RL frameworks supporting massive-scale agent scaffolds and environment orchestration.
For more details, please refer to our blog post [Qwen3.5](https://qwen.ai/blog?id=qwen3.5).
## Model Overview
- Type: Causal Language Model with Vision Encoder
- Training Stage: Pre-training & Post-training
- Language Model
- Number of Parameters: 2B
- Hidden Dimension: 2048
- Token Embedding: 248320 (Padded)
- Number of Layers: 24
- Hidden Layout: 6 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN))
- Gated DeltaNet:
- Number of Linear Attention Heads: 16 for V and 16 for QK
- Head Dimension: 128
- Gated Attention:
- Number of Attention Heads: 8 for Q and 2 for KV
- Head Dimension: 256
- Rotary Position Embedding Dimension: 64
- Feed Forward Network:
- Intermediate Dimension: 6144
- LM Output: 248320 (Tied to token embedding)
- MTP: trained with multi-steps
- Context Length: 262,144 natively
## Benchmark Results
### Language
| Qwen3.5-2B | Qwen3.5-2B-NVFP4 | |
|---|---|---|
| Instruct (Non-Thinking) Mode | ||
| MMLU-Pro | 55.3 | 54.5 |
| MMLU-Redux | 69.2 | 67.8 |
| C-Eval | 65.2 | 63.6 |
| SuperGPQA | 30.4 | 30.1 |
| IFEval | 61.2 | 59.5 |
| MMMLU | 56.9 | 55.4 |
| Knowledge & STEM (Thinking) | ||
| MMLU-Pro | 66.5 | 65.3 |
| MMLU-Redux | 79.6 | 77.6 |
| C-Eval | 73.2 | 72.2 |
| SuperGPQA | 37.5 | 36.8 |
| GPQA | 51.6 | 50.7 |
| Instruction Following (Thinking) | ||
| IFEval | 78.6 | 77.2 |
| IFBench | 41.3 | 40.8 |
| MultiChallenge | 33.7 | 33.2 |
| Long Context (Thinking) | ||
| AA-LCR | 25.6 | 25.2 |
| LongBench v2 | 38.7 | 38.1 |
| Reasoning (Thinking) | ||
| HMMT Feb 25 | 22.9 | 22.6 |
| HMMT Nov 25 | 19.6 | 19.4 |
| General Agent (Thinking) | ||
| BFCL-V4 | 43.6 | 42.8 |
| TAU2-Bench | 48.8 | 48.1 |
| Multilingualism (Thinking) | ||
| MMMLU | 63.1 | 61.9 |
| MMLU-ProX | 52.3 | 51.3 |
| NOVA-63 | 46.4 | 45.6 |
| INCLUDE | 55.4 | 54.0 |
| Global PIQA | 69.3 | 66.7 |
| PolyMATH | 26.1 | 25.2 |
| WMT24++ | 45.8 | 44.9 |
| MAXIFE | 60.6 | 59.5 |
* TAU2-Bench: we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.
* MMLU-ProX: we report the averaged accuracy on 29 languages.
* WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.
* MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).
* Experimental settings: top_p=0.95, top_k=20, presence_penalty=1.5, and temperature=1.0 were used.
* Empty cells (--) indicate scores not yet available or not applicable.
| Qwen3.5-2B | Qwen3.5-2B-NVFP4 | |
|---|---|---|
| STEM and Puzzle | ||
| MMMU | 64.2/64.2 | 64.2/64.2 |
| MMMU-Pro | 50.3/47.7 | 50.3/47.7 |
| Mathvista(mini) | 76.7/73.9 | 76.7/73.9 |
| DynaMath | 73.6/69.6 | 73.6/69.6 |
| ZEROBench | 1/0 | 1/0 |
| ZEROBench_sub | 17.1/18.6 | 17.1/18.6 |
| VlmsAreBlind | 75.8/74.3 | 75.8/74.3 |
| General VQA | ||
| RealWorldQA | 74.5/71.2 | 74.5/71.2 |
| MMStar | 71.7/68.0 | 71.7/68.0 |
| MMBenchEN-DEV-v1.1 | 83.3/81.3 | 83.3/81.3 |
| SimpleVQA | 38.5/39.5 | 38.5/39.5 |
| HallusionBench | 58.0/51.3 | 58.0/51.3 |
| Text Recognition and Document Understanding | ||
| MMLongBench-Doc | 45.4/38.8 | 45.4/38.8 |
| AI2D_TEST | 83.3/81.5 | 83.3/81.5 |
| CC-OCR | 72.9/75.8 | 72.9/75.8 |
| OmniDocBench1.5 | 79.8/80.9 | 79.8/80.9 |
| CharXiv(RQ) | 58.8/52.6 | 58.8/52.6 |
| OCRBench | 84.5/85.4 | 84.5/85.4 |
| Spatial Intelligence | ||
| RefCOCO(avg) | 84.8/84.3 | 84.8/84.3 |
| CountBench | 91.4/86.8 | 91.4/86.8 |
| ODInW13 | 35.9/40.5 | 35.9/40.5 |
| ERQA | 43.8/33.0 | 43.8/33.0 |
| EmbSpatialBench | 77.9/66.4 | 77.9/66.4 |
| RefSpatialBench | 32.9/30.0 | 32.9/30.0 |
| Hypersim | 12.4/12.4 | 12.4/12.4 |
| SUNRGBD | 28.7/25.6 | 28.7/25.6 |
| Nuscene | 6.9/8.5 | 6.9/8.5 |
| Video Understanding | ||
| VideoMME(w sub.) | 75.6/-- | 75.6/-- |
| VideoMME(w/o sub.) | 69.0/-- | 69.0/-- |
| VideoMMMU | 62.1/-- | 62.1/-- |
| MLVU | 76.2/-- | 76.2/-- |
| MVBench | 64.9/-- | 64.9/-- |
| LVBench | 57.1/-- | 57.1/-- |
| MMVU | 48.6/-- | 48.6/-- |
| Visual Agent | ||
| ScreenSpot Pro | --/54.5 | --/54.5 |
| Medical VQA | ||
| SLAKE | 74.4/67.5 | 74.4/67.5 |
| PMC-VQA | 48.8/54.0 | 48.8/54.0 |
| MedXpertQA-MM | 26.9/19.1 | 26.9/19.1 |
* Scores of Qwen3.5 models are reported as Thinking / Non-thinking.
* MathVision: our model’s score is evaluated using a fixed prompt, e.g., “Please reason step by step, and put your final answer within \boxed{}.” For other models, we report the higher score between runs with and without the \boxed{} formatting.
* Experimental settings: For the Video benchmarks, we used top_p=0.95, top_k=20, presence_penalty=1.5, and temperature=1.0. All other benchmarks adopted the same hyperparameter configuration but with temperature=0.6 under the thinking mode. Under the no-thinking mode, the inference hyperparameters were set to top_p=0.8, top_k=20, presence_penalty=1.5, and temperature=0.7.
* Empty cells (--) indicate scores not yet available or not applicable.