---
base_model: Qwen/Qwen3.5-122B-A10B
base_model_relation: quantized
language:
  - en
  - multilingual
license: apache-2.0
library_name: vllm
pipeline_tag: image-text-to-text
tags:
  - qwen3.5
  - moe
  - vision-language
  - multimodal
  - deltanet
  - quantized
  - mixed-precision
  - nvfp4
  - mxfp8
  - compressed-tensors
  - prismaquant
  - mtp
  - speculative-decoding
  - vllm
---

# Qwen3.5-122B-A10B — PrismaQuant 4.76 bpp

[![PrismaQuant source](https://img.shields.io/badge/PrismaQuant-GitHub-blue?logo=github)](https://github.com/RobTand/prismaquant)
[![License: Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-green)](https://huggingface.co/Qwen/Qwen3.5-122B-A10B/blob/main/LICENSE)
[![vLLM native](https://img.shields.io/badge/vLLM-compressed--tensors-orange)](https://docs.vllm.ai/en/latest/features/quantization/compressed_tensors.html)

Mixed-precision quantization of `Qwen/Qwen3.5-122B-A10B` produced by
[**PrismaQuant**](https://github.com/RobTand/prismaquant) — a per-Linear
sensitivity-driven allocator that chooses each Linear module's format
individually under a total-bit budget.

**Why "every layer refracts into a different format":** a naive uniform
NVFP4 either leaves disk on the table (keeping everything BF16 "to be
safe") or loses quality (quantizing sensitive layers to 4-bit).
PrismaQuant measures the actual Fisher-weighted MSE for every (Linear,
format) pair and runs a multi-choice knapsack under a total-bit budget,
so every bit lives where it buys the most likelihood.

---

## At a glance

| Metric | BF16 source | **This artifact** | Delta |
|---|---:|---:|---:|
| Size on disk | 244 GB | **72 GB** | **−70 %** |
| Fraction of original weights | 100 % | **29.5 %** | |
| Average bits per param | 16 | **4.76** | |
| Multimodal (vision + text) | ✓ | **✓** | |
| MTP speculative decoding heads | ✓ | **✓** | |
| Loads in vLLM (stock `compressed-tensors`) | ✓ | **✓** | |
| Runtime backend | any | **vLLM only** | |

---

## Precision mix

This checkpoint uses **three precisions**, selected per-Linear by the
allocator from measured sensitivity — not chosen uniformly:

| Format | W | A | Use | Count |
|---|---|---|---|---:|
| **NVFP4** | 4-bit (FP4, group_size=16 with per-group FP8 scale + per-tensor global) | 4-bit (dynamic) | Bulk MoE experts + medium-sensitivity dense Linears + full visual encoder | **72 dense + 96 per-expert + 2 MTP per-expert + visual NVFP4s = 170+** |
| **MXFP8** | 8-bit (E4M3, group_size=32 with per-group E8M0 scale) | 8-bit (dynamic) | High-sensitivity dense Linears the allocator won't risk at 4-bit | **12 Linears** |
| **BF16** | 16-bit | 16-bit | Router, norms, biases, embed / lm_head, pos_embed, layer passthrough | **404 entries** |

The allocator couples MoE `gate_up_proj` / `down_proj` so siblings share
a scheme (vLLM's FusedMoE requires this), and fused attention siblings
(`q_proj`/`k_proj`/`v_proj`) share one per-tensor global scale so the
packed `qkv_proj` loads without the "accuracy mismatch" warning.

### Activation-aware passes applied during export

On every NVFP4 weight the exporter runs, in order:

1. **GPTQ-OBS one-shot rounding** — block-wise error propagation along
   the group-quant structure using the calibration Hessian. Closed-form,
   not iterative. Handles cross-column activation coupling.
2. **Closed-form per-group scale sweep** — for each 16-weight NVFP4
   group, enumerate `grid=32` candidate scales spanning
   `[0.5·s₀, 1.5·s₀]`, round each weight to its nearest codebook
   neighbor at every candidate scale, pick the (scale, rounding-set)
   configuration minimizing activation-weighted per-group MSE
   `sum_j a_j² · (w_orig,j - w_q,j)²`. Improve-or-keep gate against
   the post-GPTQ weight. Row-chunked to keep peak memory <2 GB
   regardless of Linear shape.

Scale_sweep is the **closed-form analog of Intel's AutoRound** — where
AutoRound learns per-weight continuous rounding offsets V via 200 SGD
iterations on a relaxed loss, scale_sweep enumerates the discrete scale
dimension directly and lets RTN pick rounding conditional on scale.
No gradient descent, sub-second per Linear after the row-chunked fix.

**Measured per-Linear output-MSE vs RTN baseline (Qwen3.6-35B, mixed
visual + MTP Linears, geomean — 122B shape class is similar):**

| Pipeline variant | out_mse ratio vs RTN |
|---|---:|
| RTN (no passes) | 1.00 |
| GPTQ only | 0.41 |
| GPTQ + act_round polish (**prior pipeline**) | **0.99** (act_round undid GPTQ) |
| scale_sweep only | 0.33 |
| **GPTQ + scale_sweep (this artifact)** | **0.33** |

The prior pipeline's `act_round` polish turned out to systematically
undo GPTQ's cross-column error propagation — its per-weight metric
minima don't respect GPTQ's compensation structure. scale_sweep
replaces it as a strict improvement.

AWQ's γ-fold is **not** applied. On NVFP4's 16-channel groups,
AWQ's per-channel rescaling pushes mixed-scale values into the
same group and inflates per-group quant noise rather than reducing it.

---

## Which layers are quantized

### Text body (DeltaNet linear-attention + dense MoE, 48 layers)

- **Full attention** Linears (`q_proj`, `k_proj`, `v_proj`, `o_proj`):
  mixed NVFP4 / MXFP8 / BF16 per-Linear by sensitivity
- **DeltaNet linear-attention** Linears (`in_proj_qkv`, `in_proj_z`,
  `in_proj_a`, `in_proj_b`, `out_proj`): same
- **MoE experts** (`gate_up_proj`, `down_proj`, 64 experts per MoE
  layer): per-expert NVFP4 with joint per-tensor scale across the
  `gate_up` pair so vLLM FusedMoE loads them
- **Shared expert** MLP: same per-Linear policy
- **Router** (`mlp.gate`): always BF16 (tiny, sensitive)

### Multi-token-prediction (MTP) head

- Speculative-decoding head (1 layer) + its own MoE block: same
  per-Linear policy, so `--speculative-config method=mtp` drafts at the
  same precision as the body.

### Visual encoder (27 blocks — Qwen3.5-VL vision tower)

- **Fisher-driven per-Linear allocation:** 108 of 110 visual Linears got
  placed by the full DP allocator on the basis of per-Linear
  activation-weighted cost (8 multimodal calibration samples, 110
  Linears tracked via the `model.visual.*` module tree).
- **Remaining 2 un-probed visual Linears** (`patch_embed.proj` edges
  the probe didn't tap) stamped at NVFP4 uniformly.
- **`model.visual.pos_embed`** stays BF16 — it's a learnable parameter,
  not an `nn.Linear`, and vLLM's compressed-tensors loader cannot
  consume a quantized Parameter layout. The allocator's discover pass
  excludes it explicitly.
- This is **the same treatment body Linears get**. There is ONE
  incremental code path: the streaming multimodal probe keeps the visual
  tower (~2 GB) resident on GPU while it streams the 244 GB body
  layer-by-layer, capturing Fisher through `inputs_embeds.backward(grad)`
  that propagates into visual weights.

### Passthrough (unquantized)

- `lm_head` — kept at BF16 because vLLM's `ParallelLMHead` module only
  accepts a single `weight` parameter. The allocator measures lm_head's
  Fisher sensitivity and would pick NVFP4 for it (saving ~1.1 GB), but
  the compressed-tensors runtime rejects a compressed lm_head with
  `KeyError: lm_head.input_global_scale` because its scheme registry
  doesn't include `ParallelLMHead`. **This is a vLLM runtime limitation,
  not a PrismaQuant design decision.**
- RMSNorm weights (all layers + MTP + visual)
- All biases
- `embed_tokens`
- `model.visual.pos_embed` (Parameter/Embedding, see above)

---

## Serving (vLLM only)

This artifact is **only** runnable via vLLM's stock `compressed-tensors`
support — there is no transformers-native runtime path for mixed NVFP4 +
MXFP8 with packed-MoE experts today. vLLM 0.11+ or equivalent is
required.

```bash
vllm serve rdtand/Qwen3.5-122B-A10B-PrismaQuant-4.75bit-vllm \
    --trust-remote-code \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
```

- **FlashInfer** NVFP4 attention is picked up automatically; set
  `VLLM_USE_FLASHINFER_NVFP4=1` to make the preference explicit.
- **MTP speculative decoding** at `n=3` is the measured optimum for
  this family on DGX Spark (n=2 leaves ~10 % tok/s on the table, n=4
  regresses).
- **Visual inputs** work via vLLM's standard `image-text-to-text` chat
  API — no special flags.

---

## Reproducing this artifact

Full pipeline is in the [PrismaQuant repo](https://github.com/RobTand/prismaquant):

1. **Sensitivity probe** — streaming per-shard empirical-Fisher trace
   (diagonal) across body + MTP + visual Linears. Each shard holds only
   its ~2 layers resident; the rest of the model is on disk or meta. 8
   multimodal calibration samples drive visual Fisher through one
   unified streaming context.
2. **Per-(Linear, format) cost measurement** — for each Linear and each
   candidate format, the per-group RTN error weighted by cached input
   activations. Incremental: same per-shard streaming as the probe.
3. **Multi-choice knapsack allocator** — picks one format per Linear
   minimizing total predicted Δloss under the bit budget. Target 4.75
   bpp; achieved 4.758 bpp here. Known-non-Linear rank-2 tensors
   (`pos_embed`, `rotary_emb`) are excluded from the visual pool.
4. **Export** — streams each body / visual / MTP shard, applies GPTQ +
   activation-weighted rounding to its NVFP4 entries, writes the
   compressed-tensors format. `lm_head` passthrough at BF16 enforced
   at this stage (see known issues).

Wall-clock on a DGX Spark (128 GB unified memory): ~1 hour on cached
probe + cost + activation shards (body shards are invariant across
export-pass flag changes, so only the final export stage reruns when
you change a flag).

---

## Known issues / limitations

- **vLLM only at serve time.** No transformers-runtime path for this
  precision mix today.
- **lm_head stays BF16** because vLLM's `ParallelLMHead` does not
  register the NVFP4/MXFP8 compressed-tensors schemes. Allocator
  measured it and would have picked NVFP4; the runtime limitation
  forces BF16. Costs ~1.1 GB on the disk footprint.
- **MTP n=4 regresses on this family.** Stick to `n=3` unless you
  verify against the draft-head acceptance-rate trace.
- **Peak VRAM residency** on DGX Spark (unified memory) is ~86 GB with
  FP8 KV cache at 32 k context; tune `--gpu-memory-utilization` and
  `--max-model-len` if the machine is shared.

---

## Links

- **Source:** [github.com/RobTand/prismaquant](https://github.com/RobTand/prismaquant)
- **Base model:** [Qwen/Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B)
- **Sibling 35B:** [Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm](https://huggingface.co/rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm)

## Citation

```bibtex
@software{prismaquant2026,
  title        = {PrismaQuant: per-Linear sensitivity-driven mixed-precision
                  quantization for LLMs},
  author       = {Tand, Rob},
  year         = 2026,
  url          = {https://github.com/RobTand/prismaquant},
}
```