---
language:
  - en
  - zh
  - it
license: apache-2.0
library_name: transformers
base_model: Qwen/Qwen3-Next-80B-A3B-Instruct
tags:
  - nvfp4
  - quantized
  - fp4
  - moe
  - mixture-of-experts
  - vllm
  - blackwell
  - qwen3
  - nvidia
  - compressed-tensors
pipeline_tag: text-generation
model_type: qwen3_next
quantized_by: Sophia-AI
---

# ⚡ Qwen3-Next-80B-A3B-Instruct-NVFP4

> **NVFP4 quantization of [Qwen3-Next-80B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct) — 160GB → 44.6GB, ready for single-GPU deployment.**

A high-quality NVFP4 (NVIDIA FP4) quantization of Qwen's flagship Mixture-of-Experts model, calibrated on Italian-language data with full expert coverage. Designed for production inference with [vLLM](https://github.com/vllm-project/vllm) on NVIDIA Blackwell, Hopper, and Ada GPUs.

---

## 🏗️ Model Overview

| | |
|---|---|
| 🧬 **Architecture** | Qwen3-Next — MoE with DeltaNet (linear attention) + standard attention |
| 📐 **Parameters** | 80B total, 3B active per token (512 experts, top-10 routing) |
| 🗜️ **Quantization** | NVFP4 (4-bit floating point) with FP8 KV cache |
| 📦 **Size** | 44.6 GB (from 160 GB BF16) — **72% reduction** |
| 🔧 **Format** | `compressed-tensors` — native vLLM support |

---

## 🚀 Quick Start

### vLLM (recommended)

```bash
vllm serve Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4 \
    --kv-cache-dtype fp8
```

### vLLM with Docker

```bash
docker run --gpus all \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4 \
    --kv-cache-dtype fp8
```

### Python (OpenAI-compatible API)

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain mixture-of-experts architectures in simple terms."},
    ],
    max_tokens=512,
)
print(response.choices[0].message.content)
```

### Python (Transformers)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is DeltaNet and how does it differ from standard attention?"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## 🔬 Quantization Details

### Method

NVFP4 quantization using [llmcompressor](https://github.com/vllm-project/llm-compressor) v0.9.0 with the `compressed-tensors` format. Weights are quantized to 4-bit NVIDIA floating point with per-channel global scales, and the KV cache is quantized to FP8 for additional memory savings during inference.

### Calibration

| | |
|---|---|
| 📊 **Samples** | 512 |
| 📏 **Sequence length** | 1024 tokens |
| 🌍 **Calibration language** | Italian |
| 🔀 **MoE coverage** | All 512 experts calibrated (`moe_calibrate_all_experts=True`) |
| ⚙️ **Pipeline** | Basic (full GPU, no CPU offload) |
| 🖥️ **Hardware** | 2× NVIDIA B200 SXM (358 GB VRAM) |
| ⏱️ **Total time** | ~4 hours |

### Preserved Layers (not quantized)

The following layers are kept in their original precision to preserve model quality:

| Pattern | Reason |
|---|---|
| `lm_head` | Output projection — critical for token prediction |
| `mlp.gate` | MoE routing gates — low parameter count, high impact |
| `mlp.shared_expert_gate` | Shared expert gating — controls expert selection |
| `linear_attn.*` | DeltaNet layers — specialized linear attention mechanism |
| `self_attn.q_proj` | Query projection on standard attention layers |
| `self_attn.k_proj` | Key projection on standard attention layers |
| `self_attn.v_proj` | Value projection on standard attention layers |

> These exclusions follow NVIDIA's official quantization configuration for this architecture. A total of **385 modules** are preserved in original precision.

---

## 💻 Hardware Requirements

| Setup | VRAM | Notes |
|---|---|---|
| 1× B200 (192 GB) | ~45 GB | ✅ Recommended — plenty of headroom for KV cache |
| 1× H200 (141 GB) | ~45 GB | ✅ Works well |
| 1× A100 (80 GB) | ~45 GB | ✅ Works — monitor KV cache with long contexts |
| 1× H100 (80 GB) | ~45 GB | ✅ Works — same as A100 |
| 1× RTX 4090 (24 GB) | ~45 GB | ❌ Insufficient VRAM |

> The FP8 KV cache (`--kv-cache-dtype fp8`) is recommended for all deployments to maximize context length within available VRAM.

---

## 🏛️ Architecture Notes

Qwen3-Next introduces a **hybrid attention architecture** that alternates between:

- **DeltaNet (linear attention):** Layers 0, 1, 2, 4, 5, 6, 8, 9, 10, ... — efficient linear-complexity attention
- **Standard attention:** Layers 3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47 — full quadratic attention every 4th layer

This hybrid design enables efficient long-context processing while maintaining the representational power of standard attention at regular intervals. The MoE routing activates 10 out of 512 experts per token, keeping inference compute at ~3B active parameters despite the 80B total.

---

## ⚠️ Important Notes

- 🎯 **Calibration language** — calibrated on Italian data. The model retains its full multilingual capabilities, but quantization quality may be slightly optimized for Italian and similar Romance languages.
- 📏 **Sequence length** — calibrated at 1024 tokens. The model supports longer contexts but quantization statistics are optimized for this range.
- 🔧 **vLLM recommended** — `compressed-tensors` format is natively supported by vLLM. Other inference engines may require conversion.
- 📊 **Benchmarks** — coming soon. Community evaluations welcome.

---

## 📜 License

This model inherits the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license from the base model.

---

<p align="center">
  Quantized with ❤️ by <a href="https://landing.2sophia.ai">Sophia AI</a><br>
  <em>NVFP4 via llmcompressor • 512 experts fully calibrated • Ready for vLLM</em>
</p>