--- base_model: Qwen/Qwen3.5-122B-A10B base_model_relation: quantized language: - en - multilingual license: apache-2.0 library_name: vllm pipeline_tag: image-text-to-text tags: - qwen3.5 - moe - vision-language - multimodal - deltanet - quantized - mixed-precision - nvfp4 - mxfp8 - compressed-tensors - prismaquant - mtp - speculative-decoding - vllm --- # Qwen3.5-122B-A10B — PrismaQuant 4.76 bpp [![PrismaQuant source](https://img.shields.io/badge/PrismaQuant-GitHub-blue?logo=github)](https://github.com/RobTand/prismaquant) [![License: Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-green)](https://huggingface.co/Qwen/Qwen3.5-122B-A10B/blob/main/LICENSE) [![vLLM native](https://img.shields.io/badge/vLLM-compressed--tensors-orange)](https://docs.vllm.ai/en/latest/features/quantization/compressed_tensors.html) Mixed-precision quantization of `Qwen/Qwen3.5-122B-A10B` produced by [**PrismaQuant**](https://github.com/RobTand/prismaquant) — a per-Linear sensitivity-driven allocator that chooses each Linear module's format individually under a total-bit budget. **Why "every layer refracts into a different format":** a naive uniform NVFP4 either leaves disk on the table (keeping everything BF16 "to be safe") or loses quality (quantizing sensitive layers to 4-bit). PrismaQuant measures the actual Fisher-weighted MSE for every (Linear, format) pair and runs a multi-choice knapsack under a total-bit budget, so every bit lives where it buys the most likelihood. --- ## At a glance | Metric | BF16 source | **This artifact** | Delta | |---|---:|---:|---:| | Size on disk | 244 GB | **72 GB** | **−70 %** | | Fraction of original weights | 100 % | **29.5 %** | | | Average bits per param | 16 | **4.76** | | | Multimodal (vision + text) | ✓ | **✓** | | | MTP speculative decoding heads | ✓ | **✓** | | | Loads in vLLM (stock `compressed-tensors`) | ✓ | **✓** | | | Runtime backend | any | **vLLM only** | | --- ## Precision mix This checkpoint uses **three precisions**, selected per-Linear by the allocator from measured sensitivity — not chosen uniformly: | Format | W | A | Use | Count | |---|---|---|---|---:| | **NVFP4** | 4-bit (FP4, group_size=16 with per-group FP8 scale + per-tensor global) | 4-bit (dynamic) | Bulk MoE experts + medium-sensitivity dense Linears + most visual-encoder Linears | **171 Linears + 1 MTP dense + 98 per-expert** | | **MXFP8** | 8-bit (E4M3, group_size=32 with per-group E8M0 scale) | 8-bit (dynamic) | High-sensitivity dense Linears the allocator won't risk at 4-bit | **12 Linears** | | **BF16** | 16-bit | 16-bit | Norms, biases, embed / lm_head, 3 visual Linears the probe didn't tap | **407 entries** | The allocator couples MoE `gate_up_proj` / `down_proj` so siblings share a scheme (vLLM's FusedMoE requires this), and fused attention siblings (`q_proj`/`k_proj`/`v_proj`) share one per-tensor global scale so the packed `qkv_proj` loads without the "accuracy mismatch" warning. ### Activation-aware passes applied during export On every NVFP4 weight the exporter runs, in order: 1. **GPTQ-OBS one-shot rounding** — block-wise error propagation along the group-quant structure using the calibration Hessian. Closed-form, not iterative. 2. **Activation-weighted rounding polish** — per-weight `argmin_{up,down}(Δw² · E[a²])` picks the grid neighbor that minimizes output-space error under the actual input distribution. AWQ's γ-fold is **not** applied on this artifact. On NVFP4's 16-channel groups, AWQ's per-channel rescaling pushes mixed-scale values into the same group and inflates per-group quant noise rather than reducing it (measured +3.3× PPL on Qwen3.6-35B in isolation). GPTQ + activation-weighted rounding are group-aware and measurably help: baseline PPL 4.97 → 4.86 (−2.2 %) on a held-out corpus with the 35B sibling. --- ## Which layers are quantized ### Text body (DeltaNet linear-attention + dense MoE, 48 layers) - **Full attention** Linears (`q_proj`, `k_proj`, `v_proj`, `o_proj`): mixed NVFP4 / MXFP8 / BF16 per-Linear by sensitivity - **DeltaNet linear-attention** Linears (`in_proj_qkv`, `in_proj_z`, `in_proj_a`, `in_proj_b`, `out_proj`): same - **MoE experts** (`gate_up_proj`, `down_proj`, 64 experts per MoE layer): per-expert NVFP4 with joint per-tensor scale across the `gate_up` pair so vLLM FusedMoE loads them - **Shared expert** MLP: same per-Linear policy - **Router** (`mlp.gate`): always BF16 (tiny, sensitive) ### Multi-token-prediction (MTP) head - Speculative-decoding head (1 layer) + its own MoE block: same per-Linear policy, so `--speculative-config method=mtp` drafts at the same precision as the body. ### Visual encoder (27 blocks — Qwen3.5-VL vision tower) - **Fisher-driven per-Linear allocation:** 108 of 111 visual Linears got placed by the full DP allocator on the basis of per-Linear activation-weighted cost (8 multimodal calibration samples, 110 Linears tracked via the `model.visual.*` module tree). The remaining 3 visual Linears (position-embed internals the probe didn't tap) default to BF16 to be safe. - This is **the same treatment body Linears get**. There is ONE incremental code path: the streaming multimodal probe keeps the visual tower (~2 GB) resident on GPU while it streams the 244 GB body layer-by-layer, capturing Fisher through `inputs_embeds.backward(grad)` that propagates into visual weights. ### Passthrough (unquantized) - `lm_head` (explicit ignore; quantizing it can regress the output head and the savings are trivial) - RMSNorm weights (all layers + MTP + visual) - All biases - `embed_tokens` --- ## Serving (vLLM only) This artifact is **only** runnable via vLLM's stock `compressed-tensors` support — there is no transformers-native runtime path for mixed NVFP4 + MXFP8 with packed-MoE experts today. vLLM 0.11+ or equivalent is required. ```bash vllm serve rdtand/Qwen3.5-122B-A10B-PrismaQuant-4.75bit-vllm \ --trust-remote-code \ --max-model-len 32768 \ --gpu-memory-utilization 0.90 \ --speculative-config '{"method":"mtp","num_speculative_tokens":3}' ``` - **FlashInfer** NVFP4 attention is picked up automatically; set `VLLM_USE_FLASHINFER_NVFP4=1` to make the preference explicit. - **MTP speculative decoding** at `n=3` is the measured optimum for this family on DGX Spark (n=2 leaves ~10 % tok/s on the table, n=4 regresses). - **Visual inputs** work via vLLM's standard `image-text-to-text` chat API — no special flags. --- ## Reproducing this artifact Full pipeline is in the [PrismaQuant repo](https://github.com/RobTand/prismaquant): 1. **Sensitivity probe** — streaming per-shard empirical-Fisher trace (diagonal) across body + MTP + visual Linears. Each shard holds only its ~2 layers resident; the rest of the model is on disk or meta. 8 multimodal calibration samples drive visual Fisher through one unified streaming context. 2. **Per-(Linear, format) cost measurement** — for each Linear and each candidate format, the per-group RTN error weighted by cached input activations. Incremental: same per-shard streaming as the probe. 3. **Multi-choice knapsack allocator** — picks one format per Linear minimizing total predicted Δloss under the bit budget. Target 4.75 bpp; achieved 4.76 bpp here. 4. **Export** — streams each body / visual / MTP shard, applies GPTQ + activation-weighted rounding to its NVFP4 entries, writes the compressed-tensors format. Wall-clock on a DGX Spark (128 GB unified memory): ~1 hour on cached probe + cost + activation shards (body shards are invariant across export-pass flag changes, so only the final export stage reruns when you change a flag). --- ## Known issues / limitations - **vLLM only at serve time.** No transformers-runtime path for this precision mix today. - **MTP n=4 regresses on this family.** Stick to `n=3` unless you verify against the draft-head acceptance-rate trace. - **Visual outlier Linears** (3 total: position-embed internals) remain BF16 because the calibration loader didn't tap them. Savings are ~200 MB on the 72 GB total, not worth forcing. - **Peak VRAM residency** on DGX Spark (unified memory) is ~86 GB with FP8 KV cache at 32 k context; tune `--gpu-memory-utilization` and `--max-model-len` if the machine is shared. --- ## Links - **Source:** [github.com/RobTand/prismaquant](https://github.com/RobTand/prismaquant) - **Base model:** [Qwen/Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B) - **Sibling 35B:** [Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm](https://huggingface.co/rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm) ## Citation ```bibtex @software{prismaquant2026, title = {PrismaQuant: per-Linear sensitivity-driven mixed-precision quantization for LLMs}, author = {Tand, Rob}, year = 2026, url = {https://github.com/RobTand/prismaquant}, } ```