TheHouseOfTheDude
/

Qwen3-Next-80B-A3B-Instruct_Compressed-Tensors

+---
+language:
+- en
+library_name: vllm
+pipeline_tag: text-generation
+tags:
+  - text-generation
+  - conversational
+  - compressed-tensors
+  - awq
+  - w4a16
+  - w8a16
+  - quantized
+base_model: Qwen/Qwen3-Next-80B-A3B-Instruct
+base_model_relation: quantized
+quantized_by: TheHouseOfTheDude
+license: apache-2.0
+---
+# Qwen3-Next-80B-A3B-Instruct — **Quantized** (compressed-tensors for vLLM)
+This repository provides **quantized runtime packages** of
+**[Qwen/Qwen3-Next-80B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct)**, repackaged for **vLLM** using the **compressed-tensors** format.
+> **TL;DR**
+> - **This repo is quantized** with branches **W4A16-ASYM** and **W8A16**.
+> - Load with **vLLM** using `--quantization compressed-tensors`.
+> - Qwen3‑Next **A3B** is an 80B‑parameter *hybrid MoE* model that **activates ~3B** params per token and supports **ultra‑long context (≈262K)**. Only a subset of experts is active at a time, but full weights still must be resident in GPU/CPU memory for fast inference.
+---
+## What’s special about **Qwen3‑Next** (A3B Instruct)
+- **Hybrid MoE / A3B**: 80B total params with ~**3B activated** at inference; experts are sparsely selected per token.
+- **Experts**: 100s of experts with a small **top‑k** activated per token; includes a shared expert for stability.
+- **Context length**: native **≈262,144 tokens** (and beyond with certain frameworks).
+- **Instruction‑tuned** variant (this repo) – optimized for stable, formatted chat responses (no “thinking” traces).
+> See the parent model card and official posts for detailed specs and benchmarks.
+---
+## Revisions & Branches
+> The **`main`** branch is a **landing page** (model card + links). All runnable artifacts live under per‑revision branches.
+- **main** — placeholder / landing page
+- **W4A16-ASYM** — 4‑bit weights / 16‑bit activations builds and runtime assets
+- **W8A16** — 8‑bit weights / 16‑bit activations builds
+**Quick links:**
+- 🔗 **[`main`](https://huggingface.co/TheHouseOfTheDude/Qwen3-Next-80B-A3B-Instruct_Compressed-Tensors/tree/main)**
+- 🔗 **[`W4A16-ASYM`](https://huggingface.co/TheHouseOfTheDude/Qwen3-Next-80B-A3B-Instruct_Compressed-Tensors/tree/W4A16-ASYM)**
+- 🔗 **[`W8A16`](https://huggingface.co/TheHouseOfTheDude/Qwen3-Next-80B-A3B-Instruct_Compressed-Tensors/tree/W8A16)**
+---
+## Repository Contents (per revision)
+- **Sharded quantized weights** in `.safetensors` with an index (`model.safetensors.index.json`)
+- `config.json` including **compressed‑tensors** metadata (`weight_format`, `quantization`, `quantization_config`)
+- Tokenizer artifacts (`tokenizer.json`, `tokenizer.model`, etc.)
+- Optional: `chat_template.jinja` (inherits the parent finetune’s chat format)
+> Exact files can differ by branch; see the **Files and versions** tab for each revision.
+---
+## Quantization recipe & **Qwen3‑Next nuances** (what this export does)
+These builds were created with an **AWQ W4A16** / **W8A16** recipe using `llmcompressor` and a small **WikiText** calibration set. Important choices tailored to **Qwen3‑Next A3B**:
+- **Calibration data**: `wikitext-2-raw-v1` **validation** split; **64** samples, tokenized with the **chat template**; sequence length **1024**.
+- **Format & group size**: **weight‑only INT4** with **group_size=128** (A16 activations are runtime dtype); non‑power‑of‑two channels handled.
+- **FFN policy**: **do NOT ignore** FFN projections (`gate_proj`, `up_proj`, `down_proj`) — they **are quantized**.
+- **MoE routing kept full‑precision**: router/dispatcher linears left **unquantized** (e.g., names including `router`, `expert_choice`, `dispatch`, `scores`, `route`, `topk`, `switch`) for stable expert selection.
+- **Head left unquantized**: `lm_head` remains in higher precision.
+- **MoE‑aware calibration**: `calibrate_moe_context=True` to properly calibrate sparse‑expert activations.
+- **Symmetry**: the W4A16 build here uses **symmetric** INT4 weights; activations are **BF16/FP16** at inference (A16).
+- **Save**: exported with `save_compressed=True` to write **compressed‑tensors** metadata for vLLM.
+> These design choices aim to preserve **router stability** and **FFN fidelity** in the A3B hybrid‑MoE layout while offering strong memory savings.
+---
+## Quickstart — vLLM (compressed‑tensors)
+Install vLLM (recent version recommended):
+```bash
+pip install vllm
+```
+Serve (adjust to your hardware):
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve TheHouseOfTheDude/Qwen3-Next-80B-A3B-Instruct_Compressed-Tensors   --quantization compressed-tensors   --tensor-parallel-size 8   --max-model-len 262144   --gpu-memory-utilization 0.70   --dtype bfloat16
+```
+Query via **Chat Completions**:
+```bash
+curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
+    "model": "TheHouseOfTheDude/Qwen3-Next-80B-A3B-Instruct_Compressed-Tensors",
+    "messages": [
+      {"role":"system","content":"You are Qwen3-Next (A3B), helpful, precise, and safe."},
+      {"role":"user","content":"Outline a retrieval pipeline for scientific PDFs."}
+    ],
+    "max_tokens": 512,
+    "temperature": 0.7,
+    "top_p": 0.95
+  }'
+```
+> **Note:** `compressed‑tensors` is a **vLLM runtime format**. Loading this artifact directly in vanilla 🤗 Transformers is not supported; use vLLM for inference. For Transformers, use a different export (e.g., GPTQ/AWQ compatible) or full‑precision weights.
+---
+## Prompting / Chat Template
+This package follows the parent finetune’s **chat** conventions. If a `chat_template.jinja` is present, `apply_chat_template` will use it automatically.
+---
+## Lineage
+- **Finetuned parent:** [Qwen/Qwen3-Next-80B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct)
+- **This repo:** **Quantized child** of the finetune (**compressed‑tensors** for vLLM)
+---
+## Hardware & Tips (rule‑of‑thumb)
+- 80B‑class MoE with A3B still requires housing **all 80B weights** in GPU/CPU memory, though only ~**3B** are active per token.
+- Long contexts are **KV‑cache** heavy—tune `--max-model-len` and batch size.
+- Prefer **BF16** on GPUs with native support; otherwise **FP16**.
+- Consider CUDA Graphs if stable in your stack.
+---
+## License & Usage
+This distribution inherits the licenses/policies of the **finetuned parent** model (Apache‑2.0).
+Use of the model constitutes acceptance of the upstream terms.
+---
+## Changelog
+- **v1 (current)** — Quantized compressed‑tensors exports for Qwen3‑Next‑80B‑A3B‑Instruct; added **W4A16‑ASYM** and **W8A16** branches; model card set for **Quantized** classification.