--- pipeline_tag: image-text-to-text base_model: - Qwen/Qwen3.5-35B-A3B license: apache-2.0 library_name: transformers tags: - AxionML - ModelOpt - Qwen3.5 - quantized - NVFP4 - nvfp4 - sglang --- # AxionML Qwen3.5-35B-A3B-NVFP4 > Developed by [AxionML](https://huggingface.co/AxionML) for open-source serving and deployment use cases. Part of AxionML's effort to provide ready-to-serve quantized models for the community. This is an NVFP4-quantized version of [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) (35B (3B active) parameters), quantized using [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer). Weights and activations of linear layers are quantized to FP4, reducing disk size and GPU memory by ~4x compared to BF16. **About NVFP4 quantization:** NVFP4 on Blackwell couples a compact E2M1 FP4 codebook with blockwise FP8 (E4M3) scaling over 16-element micro-blocks, so that 4-bit stored values remain numerically useful for neural-network computation. The E2M1 codebook provides a small, nonuniform set of representable magnitudes up to ±6 and relies on saturating behavior rather than IEEE NaN/Inf encodings to maximize usable range per bit. Using an FP8 block scale (rather than power-of-two-only E8M0) enables fractional scales and error-minimizing scale selection strategies such as dual-pass evaluation comparing "map max to 6" versus "map max to 4 with clipping." On Blackwell Tensor Cores, native FP4 multipliers exploit E2M1 simplicity to reduce multiplier area while higher-precision FP32 accumulation protects dot-product accuracy. > **Ready for commercial and non-commercial use under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).** > [!Tip] > For users seeking managed, scalable inference without infrastructure maintenance, the official Qwen API service is provided by [Alibaba Cloud Model Studio](https://modelstudio.alibabacloud.com/). > > In particular, **Qwen3.5-Flash** is the hosted version corresponding to Qwen3.5-35B-A3B with more production features, e.g., 1M context length by default and official built-in tools. > For more information, please refer to the [User Guide](https://www.alibabacloud.com/help/en/model-studio/text-generation). Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with unprecedented capability and efficiency. ## Qwen3.5 Highlights Qwen3.5 features the following enhancement: - **Unified Vision-Language Foundation**: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks. - **Efficient Hybrid Architecture**: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead. - **Scalable RL Generalization**: Reinforcement learning scaled across million-agent environments with progressively complex task distributions for robust real-world adaptability. - **Global Linguistic Coverage**: Expanded support to 201 languages and dialects, enabling inclusive, worldwide deployment with nuanced cultural and regional understanding. - **Next-Generation Training Infrastructure**: Near-100% multimodal training efficiency compared to text-only training and asynchronous RL frameworks supporting massive-scale agent scaffolds and environment orchestration.  For more details, please refer to our blog post [Qwen3.5](https://qwen.ai/blog?id=qwen3.5). ## Model Overview - Type: Causal Language Model with Vision Encoder - Training Stage: Pre-training & Post-training - Language Model - Number of Parameters: 35B in total and 3B activated - Hidden Dimension: 2048 - Token Embedding: 248320 (Padded) - Number of Layers: 40 - Hidden Layout: 10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE)) - Gated DeltaNet: - Number of Linear Attention Heads: 32 for V and 16 for QK - Head Dimension: 128 - Gated Attention: - Number of Attention Heads: 16 for Q and 2 for KV - Head Dimension: 256 - Rotary Position Embedding Dimension: 64 - Mixture Of Experts - Number of Experts: 256 - Number of Activated Experts: 8 Routed + 1 Shared - Expert Intermediate Dimension: 512 - LM Output: 248320 (Padded) - MTP: trained with multi-steps - Context Length: 262,144 natively and extensible up to 1,010,000 tokens. ## Benchmark Results ### Language
| Qwen3.5-35B-A3B | Qwen3.5-35B-A3B-NVFP4 | |
|---|---|---|
| Knowledge | ||
| MMLU-Pro | 85.3 | 84.0 |
| MMLU-Redux | 93.3 | 91.4 |
| C-Eval | 90.2 | 87.9 |
| SuperGPQA | 63.4 | 62.8 |
| Instruction Following | ||
| IFEval | 91.9 | 89.4 |
| IFBench | 70.2 | 68.4 |
| MultiChallenge | 60.0 | 58.9 |
| Long Context | ||
| AA-LCR | 58.5 | 57.0 |
| LongBench v2 | 59.0 | 58.2 |
| STEM & Reasoning | ||
| HLE w/ CoT | 22.4 | 22.0 |
| GPQA Diamond | 84.2 | 82.7 |
| HMMT Feb 25 | 89.0 | 87.4 |
| HMMT Nov 25 | 89.2 | 88.0 |
| Coding | ||
| SWE-bench Verified | 69.2 | 68.2 |
| Terminal Bench 2 | 40.5 | 39.8 |
| LiveCodeBench v6 | 74.6 | 73.5 |
| CodeForces | 2028 | 2002.4 |
| OJBench | 36.0 | 35.5 |
| FullStackBench en | 58.1 | 57.1 |
| FullStackBench zh | 55.0 | 54.2 |
| General Agent | ||
| BFCL-V4 | 67.3 | 66.1 |
| TAU2-Bench | 81.2 | 79.6 |
| VITA-Bench | 31.9 | 31.3 |
| DeepPlanning | 22.8 | 22.2 |
| Search Agent | ||
| HLE w/ tool | 47.4 | 45.7 |
| Browsecomp | 61.0 | 58.8 |
| Browsecomp-zh | 69.5 | 68.1 |
| WideSearch | 57.1 | 56.1 |
| Seal-0 | 41.4 | 40.5 |
| Multilingualism | ||
| MMMLU | 85.2 | 82.9 |
| MMLU-ProX | 81.0 | 78.0 |
| NOVA-63 | 57.1 | 55.0 |
| INCLUDE | 79.7 | 77.9 |
| Global PIQA | 86.6 | 85.5 |
| PolyMATH | 64.4 | 63.0 |
| WMT24++ | 76.3 | 74.0 |
| MAXIFE | 86.6 | 85.6 |
* CodeForces: evaluated on our own query set.
* TAU2-Bench: we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.
* Search Agent: most search agents built on our model adopt a simple context-folding strategy(256k): once the cumulative Tool Response length reaches a preset threshold, earlier Tool Responses are pruned from the history to keep the context within limits.
* WideSearch: we use a 256k context window without any context management.
* MMLU-ProX: we report the averaged accuracy on 29 languages.
* WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.
* MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).
* Empty cells (--) indicate scores not yet available or not applicable.
| Qwen3.5-35B-A3B | Qwen3.5-35B-A3B-NVFP4 | |
|---|---|---|
| STEM and Puzzle | ||
| MMMU | 81.4 | 79.2 |
| MMMU-Pro | 75.1 | 73.1 |
| MathVision | 83.9 | 82.0 |
| Mathvista(mini) | 86.2 | 85.1 |
| DynaMath | 85.0 | 83.4 |
| ZEROBench | 8 | 7.8 |
| ZEROBench_sub | 34.1 | 33.2 |
| VlmsAreBlind | 97.0 | 95.4 |
| BabyVision | 38.4 / 29.6 | 38.4 / 29.6 |
| General VQA | ||
| RealWorldQA | 84.1 | 81.1 |
| MMStar | 81.9 | 79.4 |
| MMBenchEN-DEV-v1.1 | 91.5 | 90.1 |
| SimpleVQA | 58.3 | 56.9 |
| HallusionBench | 67.9 | 65.9 |
| Text Recognition and Document Understanding | ||
| OmniDocBench1.5 | 89.3 | 86.8 |
| CharXiv(RQ) | 77.5 | 75.2 |
| MMLongBench-Doc | 59.5 | 58.0 |
| CC-OCR | 80.7 | 78.3 |
| AI2D_TEST | 92.6 | 90.0 |
| OCRBench | 91.0 | 89.6 |
| Spatial Intelligence | ||
| ERQA | 64.8 | 63.8 |
| CountBench | 97.8 | 96.5 |
| RefCOCO(avg) | 89.2 | 87.2 |
| ODInW13 | 42.6 | 41.4 |
| EmbSpatialBench | 83.1 | 81.6 |
| RefSpatialBench | 63.5 | 62.5 |
| LingoQA | 79.2 | 77.6 |
| Hypersim | 13.1 | 12.6 |
| SUNRGBD | 33.4 | 33.0 |
| Nuscene | 14.6 | 14.3 |
| Video Understanding | ||
| VideoMME(w sub.) | 86.6 | 84.5 |
| VideoMME(w/o sub.) | 82.5 | 81.1 |
| VideoMMMU | 80.4 | 77.4 |
| MLVU | 85.6 | 84.2 |
| MVBench | 74.8 | 73.5 |
| LVBench | 71.4 | 69.9 |
| MMVU | 72.3 | 71.3 |
| Visual Agent | ||
| ScreenSpot Pro | 68.6 | 67.1 |
| OSWorld-Verified | 54.5 | 53.2 |
| AndroidWorld | 71.1 | 69.7 |
| Tool Calling | ||
| TIR-Bench | 55.5 / 38.0 | 55.5 / 38.0 |
| V* | 92.7 / 89.5 | 92.7 / 89.5 |
| Medical VQA | ||
| SLAKE | 78.7 | 77.3 |
| PMC-VQA | 62.0 | 60.9 |
| MedXpertQA-MM | 61.4 | 60.2 |
* MathVision: our model’s score is evaluated using a fixed prompt, e.g., “Please reason step by step, and put your final answer within \boxed{}.” For other models, we report the higher score between runs with and without the \boxed{} formatting.
* BabyVision: scores reported as "with CI / without CI".
* TIR-Bench and V*: scores reported as "with CI / without CI".
* Empty cells (--) indicate scores not yet available or not applicable.