--- language: - en - zh - it license: apache-2.0 library_name: transformers base_model: Qwen/Qwen3-Next-80B-A3B-Instruct tags: - nvfp4 - quantized - fp4 - moe - mixture-of-experts - vllm - blackwell - qwen3 - nvidia - compressed-tensors pipeline_tag: text-generation model_type: qwen3_next quantized_by: Sophia-AI --- # ⚡ Qwen3-Next-80B-A3B-Instruct-NVFP4 > **NVFP4 quantization of [Qwen3-Next-80B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct) — 160GB → 44.6GB, ready for single-GPU deployment.** A high-quality NVFP4 (NVIDIA FP4) quantization of Qwen's flagship Mixture-of-Experts model, calibrated on Italian-language data with full expert coverage. Designed for production inference with [vLLM](https://github.com/vllm-project/vllm) on NVIDIA Blackwell, Hopper, and Ada GPUs. --- ## 🏗️ Model Overview | | | |---|---| | 🧬 **Architecture** | Qwen3-Next — MoE with DeltaNet (linear attention) + standard attention | | 📐 **Parameters** | 80B total, 3B active per token (512 experts, top-10 routing) | | 🗜️ **Quantization** | NVFP4 (4-bit floating point) with FP8 KV cache | | 📦 **Size** | 44.6 GB (from 160 GB BF16) — **72% reduction** | | 🔧 **Format** | `compressed-tensors` — native vLLM support | --- ## 🚀 Quick Start ### vLLM (recommended) ```bash vllm serve Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4 \ --kv-cache-dtype fp8 ``` ### vLLM with Docker ```bash docker run --gpus all \ -p 8000:8000 \ vllm/vllm-openai:latest \ --model Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4 \ --kv-cache-dtype fp8 ``` ### Python (OpenAI-compatible API) ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused") response = client.chat.completions.create( model="Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain mixture-of-experts architectures in simple terms."}, ], max_tokens=512, ) print(response.choices[0].message.content) ``` ### Python (Transformers) ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4", torch_dtype="auto", device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained( "Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4" ) messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is DeltaNet and how does it differ from standard attention?"}, ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## 🔬 Quantization Details ### Method NVFP4 quantization using [llmcompressor](https://github.com/vllm-project/llm-compressor) v0.9.0 with the `compressed-tensors` format. Weights are quantized to 4-bit NVIDIA floating point with per-channel global scales, and the KV cache is quantized to FP8 for additional memory savings during inference. ### Calibration | | | |---|---| | 📊 **Samples** | 512 | | 📏 **Sequence length** | 1024 tokens | | 🌍 **Calibration language** | Italian | | 🔀 **MoE coverage** | All 512 experts calibrated (`moe_calibrate_all_experts=True`) | | ⚙️ **Pipeline** | Basic (full GPU, no CPU offload) | | 🖥️ **Hardware** | 2× NVIDIA B200 SXM (358 GB VRAM) | | ⏱️ **Total time** | ~4 hours | ### Preserved Layers (not quantized) The following layers are kept in their original precision to preserve model quality: | Pattern | Reason | |---|---| | `lm_head` | Output projection — critical for token prediction | | `mlp.gate` | MoE routing gates — low parameter count, high impact | | `mlp.shared_expert_gate` | Shared expert gating — controls expert selection | | `linear_attn.*` | DeltaNet layers — specialized linear attention mechanism | | `self_attn.q_proj` | Query projection on standard attention layers | | `self_attn.k_proj` | Key projection on standard attention layers | | `self_attn.v_proj` | Value projection on standard attention layers | > These exclusions follow NVIDIA's official quantization configuration for this architecture. A total of **385 modules** are preserved in original precision. --- ## 💻 Hardware Requirements | Setup | VRAM | Notes | |---|---|---| | 1× B200 (192 GB) | ~45 GB | ✅ Recommended — plenty of headroom for KV cache | | 1× H200 (141 GB) | ~45 GB | ✅ Works well | | 1× A100 (80 GB) | ~45 GB | ✅ Works — monitor KV cache with long contexts | | 1× H100 (80 GB) | ~45 GB | ✅ Works — same as A100 | | 1× RTX 4090 (24 GB) | ~45 GB | ❌ Insufficient VRAM | > The FP8 KV cache (`--kv-cache-dtype fp8`) is recommended for all deployments to maximize context length within available VRAM. --- ## 🏛️ Architecture Notes Qwen3-Next introduces a **hybrid attention architecture** that alternates between: - **DeltaNet (linear attention):** Layers 0, 1, 2, 4, 5, 6, 8, 9, 10, ... — efficient linear-complexity attention - **Standard attention:** Layers 3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47 — full quadratic attention every 4th layer This hybrid design enables efficient long-context processing while maintaining the representational power of standard attention at regular intervals. The MoE routing activates 10 out of 512 experts per token, keeping inference compute at ~3B active parameters despite the 80B total. --- ## ⚠️ Important Notes - 🎯 **Calibration language** — calibrated on Italian data. The model retains its full multilingual capabilities, but quantization quality may be slightly optimized for Italian and similar Romance languages. - 📏 **Sequence length** — calibrated at 1024 tokens. The model supports longer contexts but quantization statistics are optimized for this range. - 🔧 **vLLM recommended** — `compressed-tensors` format is natively supported by vLLM. Other inference engines may require conversion. - 📊 **Benchmarks** — coming soon. Community evaluations welcome. --- ## 📜 License This model inherits the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license from the base model. ---
Quantized with ❤️ by Sophia AI
NVFP4 via llmcompressor • 512 experts fully calibrated • Ready for vLLM