phaedawg commited on
Commit
dfc4ec3
·
verified ·
1 Parent(s): 7061b8b

First ReadME.md

Browse files
Files changed (1) hide show
  1. README.md +150 -0
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: vllm
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - text-generation
8
+ - conversational
9
+ - compressed-tensors
10
+ - awq
11
+ - w4a16
12
+ - w8a16
13
+ - quantized
14
+ base_model: Qwen/Qwen3-Next-80B-A3B-Instruct
15
+ base_model_relation: quantized
16
+ quantized_by: TheHouseOfTheDude
17
+ license: apache-2.0
18
+ ---
19
+
20
+ # Qwen3-Next-80B-A3B-Instruct — **Quantized** (compressed-tensors for vLLM)
21
+
22
+ This repository provides **quantized runtime packages** of
23
+ **[Qwen/Qwen3-Next-80B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct)**, repackaged for **vLLM** using the **compressed-tensors** format.
24
+
25
+ > **TL;DR**
26
+ > - **This repo is quantized** with branches **W4A16-ASYM** and **W8A16**.
27
+ > - Load with **vLLM** using `--quantization compressed-tensors`.
28
+ > - Qwen3‑Next **A3B** is an 80B‑parameter *hybrid MoE* model that **activates ~3B** params per token and supports **ultra‑long context (≈262K)**. Only a subset of experts is active at a time, but full weights still must be resident in GPU/CPU memory for fast inference.
29
+
30
+ ---
31
+
32
+ ## What’s special about **Qwen3‑Next** (A3B Instruct)
33
+
34
+ - **Hybrid MoE / A3B**: 80B total params with ~**3B activated** at inference; experts are sparsely selected per token.
35
+ - **Experts**: 100s of experts with a small **top‑k** activated per token; includes a shared expert for stability.
36
+ - **Context length**: native **≈262,144 tokens** (and beyond with certain frameworks).
37
+ - **Instruction‑tuned** variant (this repo) – optimized for stable, formatted chat responses (no “thinking” traces).
38
+
39
+ > See the parent model card and official posts for detailed specs and benchmarks.
40
+
41
+ ---
42
+
43
+ ## Revisions & Branches
44
+
45
+ > The **`main`** branch is a **landing page** (model card + links). All runnable artifacts live under per‑revision branches.
46
+
47
+ - **main** — placeholder / landing page
48
+ - **W4A16-ASYM** — 4‑bit weights / 16‑bit activations builds and runtime assets
49
+ - **W8A16** — 8‑bit weights / 16‑bit activations builds
50
+
51
+ **Quick links:**
52
+ - 🔗 **[`main`](https://huggingface.co/TheHouseOfTheDude/Qwen3-Next-80B-A3B-Instruct_Compressed-Tensors/tree/main)**
53
+ - 🔗 **[`W4A16-ASYM`](https://huggingface.co/TheHouseOfTheDude/Qwen3-Next-80B-A3B-Instruct_Compressed-Tensors/tree/W4A16-ASYM)**
54
+ - 🔗 **[`W8A16`](https://huggingface.co/TheHouseOfTheDude/Qwen3-Next-80B-A3B-Instruct_Compressed-Tensors/tree/W8A16)**
55
+
56
+ ---
57
+
58
+ ## Repository Contents (per revision)
59
+
60
+ - **Sharded quantized weights** in `.safetensors` with an index (`model.safetensors.index.json`)
61
+ - `config.json` including **compressed‑tensors** metadata (`weight_format`, `quantization`, `quantization_config`)
62
+ - Tokenizer artifacts (`tokenizer.json`, `tokenizer.model`, etc.)
63
+ - Optional: `chat_template.jinja` (inherits the parent finetune’s chat format)
64
+
65
+ > Exact files can differ by branch; see the **Files and versions** tab for each revision.
66
+
67
+ ---
68
+
69
+ ## Quantization recipe & **Qwen3‑Next nuances** (what this export does)
70
+
71
+ These builds were created with an **AWQ W4A16** / **W8A16** recipe using `llmcompressor` and a small **WikiText** calibration set. Important choices tailored to **Qwen3‑Next A3B**:
72
+
73
+ - **Calibration data**: `wikitext-2-raw-v1` **validation** split; **64** samples, tokenized with the **chat template**; sequence length **1024**.
74
+ - **Format & group size**: **weight‑only INT4** with **group_size=128** (A16 activations are runtime dtype); non‑power‑of‑two channels handled.
75
+ - **FFN policy**: **do NOT ignore** FFN projections (`gate_proj`, `up_proj`, `down_proj`) — they **are quantized**.
76
+ - **MoE routing kept full‑precision**: router/dispatcher linears left **unquantized** (e.g., names including `router`, `expert_choice`, `dispatch`, `scores`, `route`, `topk`, `switch`) for stable expert selection.
77
+ - **Head left unquantized**: `lm_head` remains in higher precision.
78
+ - **MoE‑aware calibration**: `calibrate_moe_context=True` to properly calibrate sparse‑expert activations.
79
+ - **Symmetry**: the W4A16 build here uses **symmetric** INT4 weights; activations are **BF16/FP16** at inference (A16).
80
+ - **Save**: exported with `save_compressed=True` to write **compressed‑tensors** metadata for vLLM.
81
+
82
+ > These design choices aim to preserve **router stability** and **FFN fidelity** in the A3B hybrid‑MoE layout while offering strong memory savings.
83
+
84
+ ---
85
+
86
+ ## Quickstart — vLLM (compressed‑tensors)
87
+
88
+ Install vLLM (recent version recommended):
89
+
90
+ ```bash
91
+ pip install vllm
92
+ ```
93
+
94
+ Serve (adjust to your hardware):
95
+
96
+ ```bash
97
+ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve TheHouseOfTheDude/Qwen3-Next-80B-A3B-Instruct_Compressed-Tensors --quantization compressed-tensors --tensor-parallel-size 8 --max-model-len 262144 --gpu-memory-utilization 0.70 --dtype bfloat16
98
+ ```
99
+
100
+ Query via **Chat Completions**:
101
+
102
+ ```bash
103
+ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
104
+ "model": "TheHouseOfTheDude/Qwen3-Next-80B-A3B-Instruct_Compressed-Tensors",
105
+ "messages": [
106
+ {"role":"system","content":"You are Qwen3-Next (A3B), helpful, precise, and safe."},
107
+ {"role":"user","content":"Outline a retrieval pipeline for scientific PDFs."}
108
+ ],
109
+ "max_tokens": 512,
110
+ "temperature": 0.7,
111
+ "top_p": 0.95
112
+ }'
113
+ ```
114
+
115
+ > **Note:** `compressed‑tensors` is a **vLLM runtime format**. Loading this artifact directly in vanilla 🤗 Transformers is not supported; use vLLM for inference. For Transformers, use a different export (e.g., GPTQ/AWQ compatible) or full‑precision weights.
116
+
117
+ ---
118
+
119
+ ## Prompting / Chat Template
120
+
121
+ This package follows the parent finetune’s **chat** conventions. If a `chat_template.jinja` is present, `apply_chat_template` will use it automatically.
122
+
123
+ ---
124
+
125
+ ## Lineage
126
+
127
+ - **Finetuned parent:** [Qwen/Qwen3-Next-80B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct)
128
+ - **This repo:** **Quantized child** of the finetune (**compressed‑tensors** for vLLM)
129
+
130
+ ---
131
+
132
+ ## Hardware & Tips (rule‑of‑thumb)
133
+
134
+ - 80B‑class MoE with A3B still requires housing **all 80B weights** in GPU/CPU memory, though only ~**3B** are active per token.
135
+ - Long contexts are **KV‑cache** heavy—tune `--max-model-len` and batch size.
136
+ - Prefer **BF16** on GPUs with native support; otherwise **FP16**.
137
+ - Consider CUDA Graphs if stable in your stack.
138
+
139
+ ---
140
+
141
+ ## License & Usage
142
+
143
+ This distribution inherits the licenses/policies of the **finetuned parent** model (Apache‑2.0).
144
+ Use of the model constitutes acceptance of the upstream terms.
145
+
146
+ ---
147
+
148
+ ## Changelog
149
+
150
+ - **v1 (current)** — Quantized compressed‑tensors exports for Qwen3‑Next‑80B‑A3B‑Instruct; added **W4A16‑ASYM** and **W8A16** branches; model card set for **Quantized** classification.