Qwen3.6-35B-A3B-NVFP4

Mixed-precision NVFP4 (+ optional FP8 lm_head) quantization of Qwen/Qwen3.6-35B-A3B — the Mixture-of-Experts variant with 35 B total parameters and ~3 B active per token — targeting native Blackwell (SM120) deployment, primarily RTX 5090 32 GB.

The original BF16 checkpoint needs ~67 GiB of VRAM. This build fits a single RTX 5090 32 GB at 32 K context with usable KV cache, multi-turn reasoning, hybrid attention (self-attn + GDN linear-attn), and tool-call support.

TL;DR

  • Use main + --kv-cache-dtype turboquant_4bit_nc for production workflow / tool-call / structured-output serving.
  • Use fp8-head + --kv-cache-dtype turboquant_4bit_nc when you want the highest concurrency and the workload is mostly free-form text.
  • Do not expect an fp8-head-embed branch here. FP8 embeddings were intentionally skipped after a regression canary on the 27 B sibling model.
  • The reported concurrency is measured on a single RTX 5090 32 GB with max_model_len=32768, max_num_seqs=16, and CUDA graphs enabled.

Variants

The repository hosts two branches:

Branch lm_head dtype embed_tokens dtype Intended deployment
main BF16 BF16 Production default. Workflow / agent engine — preserves output-token precision for JSON, tool calls, enums.
fp8-head FP8_BLOCK [128, 128] BF16 Free-form text and chat at higher concurrency. ~10 % more KV-cache headroom than main at the cost of a small FP8 quantization of the output projection.

Inner Linear layers (every MoE expert projection plus self-attn projection) use the same uniform NVFP4 calibration on both branches. The calibrated NVFP4 inner weights are bit-identical across the two branches; only the output projection differs.

VRAM and concurrency (RTX 5090 32 GB)

Measured on a single RTX 5090 with gpu_memory_utilization=0.93, max_model_len=32768, max_num_seqs=16, max_num_batched_tokens=4096, dtype=bfloat16.

Branch KV cache dtype Weights KV cache GPU KV cache Max concurrency @ 32 K Decode tok/s
main (bf16head) fp8_e4m3 21.88 GiB 4.63 GiB 119,472 tok 12.16× 182
main (bf16head) turboquant_4bit_nc 21.96 GiB 4.94 GiB 249,856 tok 22.45× 175
fp8-head (fp8lm) fp8_e4m3 21.40 GiB 5.10 GiB 132,048 tok 13.42× 193
fp8-head (fp8lm) turboquant_4bit_nc 21.48 GiB 5.41 GiB 274,432 tok 24.55× 184

TurboQuant 4-bit non-causal KV cache roughly doubles the KV pool versus fp8_e4m3 at the cost of ~4–5 % single-stream decode. The recommended deploy mode is turboquant_4bit_nc for both branches.

Scaling up to 262 K context (production config)

The numbers above are the conservative Gate-4 baseline (32 K context, max_num_seqs=16, gpu_memory_utilization=0.93). For production workloads the same artifact serves cleanly at the full Qwen3.6-35B-A3B context window (262 144 tokens) with gpu_memory_utilization=0.95 and max_num_seqs=64, which gives a substantially larger KV pool because the block-allocation strategy with longer contexts reduces partial-block waste:

Branch KV cache dtype Weights KV cache GPU KV cache Max concurrency @ 262 K
main (bf16head) turboquant_4bit_nc 21.96 GiB 5.48 GiB 1,022,361 tok 3.90×
main (bf16head) fp8_e4m3 21.88 GiB 4.97 GiB 492,512 tok 1.88×
fp8-head (fp8lm) turboquant_4bit_nc 21.48 GiB 5.48 GiB ~1,022,000 tok ~3.9×

A single endpoint at max_model_len=262144 therefore serves both ordinary short requests (where max_num_seqs=64 is the binding ceiling) and very long-context requests (3–4× concurrent at full 262 K) from the same engine, without needing a separate "long-context lane".

Branch selection

  • Workflow / agent engine, tool calls, structured output: use main. BF16 lm_head keeps logit ordering intact for sampling sensitive tokens (JSON braces, enum values, IDs, dates).
  • Free-form text generation, chat, summarization: use fp8-head for the extra concurrency headroom at the same context length.

Why no fp8-head-embed branch

The sibling inferRouter/Qwen3.6-27B-NVFP4 repository ships a third fp8-head-embed branch (FP8 lm_head and FP8 embed) marked lab only because a regression canary on a short Czech arithmetic prompt reproducibly flipped the model's final discount comparison while the intermediate calculation was still correct. The 35 B variant does not ship an embed_tokens-FP8 branch — the same risk applies and the saving on this architecture (~0.47 GiB) does not justify shipping a lab-only artifact.

Architecture / quantization summary

  • Backbone: Qwen3_5MoeForConditionalGeneration
  • 40 transformer layers; hybrid attention (16 self-attention layers interleaved with 24 GDN linear-attention / Mamba-style layers)
  • 256 routed experts per MoE block, top-K routed
  • NVFP4 (W4A4, group size 16) on every Linear not in the ignore list
  • BF16 stays on: visual-encoder blocks (vision tower retained for image-text-to-text), every layer's linear_attn.*, every layer's mlp.gate and mlp.shared_expert_gate, MTP graft, and (on main) lm_head + embed_tokens.
  • On fp8-head: lm_head is FP8_BLOCK [128, 128] (block-quantized float-8 with a per-128-row scale grid).

A frozen Multi-Token-Prediction head (model_mtp.safetensors, ~1.6 GB) is included for compatibility with vLLM speculative-decode setups; it is not loaded by default.

Recipe and calibration

This build follows the spirit of Red Hat / Neural Magic's LLM Compressor recipes for NVFP4 MoE checkpoints, with an adjusted calibration mix tuned for Czech-language robustness and Czech legal-domain coverage in addition to the usual English chat / math / code / multilingual diet. The raw calibration corpus is not redistributed with this model card; the public artifact records the important reproducibility metadata in the checkpoint config and branch layout.

Calibration ran one full pass at 1280 samples and 8192 max sequence length on the BF16 source checkpoint, producing the uniform NVFP4 inner used by both branches. The fp8-head branch keeps the calibrated FP8_BLOCK lm_head; the main branch surgically restores the original BF16 lm_head from the base checkpoint. This avoids a second 28-hour calibration run while preserving the same NVFP4 inner numerics.

Files

Per branch:

chat_template.jinja
config.json
generation_config.json
model.safetensors                (~22–23 GB; differs across branches by ~0.5 GB)
model.safetensors.index.json
model_mtp.safetensors            (~1.6 GB; identical across branches)
processor_config.json
recipe.yaml
tokenizer.json
tokenizer_config.json

The main branch additionally hosts a vllm_patches/ folder with the source overlay needed to dispatch FP8 weights on lm_head through the compressed-tensors path when serving the fp8-head branch on vLLM versions that have not yet merged the upstream PR.

Recommended vLLM serve config

Single RTX 5090 32 GB, full 262 K context, production-tuned, turboquant_4bit_nc KV cache, max 64 in-flight sequences:

vllm serve inferRouter/Qwen3.6-35B-A3B-NVFP4 \
  --revision main \
  --served-model-name qwen35b-a3b \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 262144 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 8192 \
  --kv-cache-dtype turboquant_4bit_nc \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --dtype bfloat16 \
  --trust-remote-code

For the fp8-head branch, change --revision to fp8-head and keep everything else the same.

If you do not need the full 262 K window and want a slightly safer VRAM headroom (extra ~0.5 GiB), drop --max-model-len to 32768 and --gpu-memory-utilization to 0.93 — that is the Gate-4 baseline whose numbers are in the upper VRAM table.

max_num_batched_tokens must be at least 4096 because the GDN linear-attn layers in this architecture impose a Mamba block-size constraint of 3072 tokens. The recommended value of 8192 keeps long-prompt TTFT reasonable without exhausting per-iteration GPU memory.

If your vLLM image does not include the upstream TurboQuant hybrid attention support yet, swap --kv-cache-dtype turboquant_4bit_nc for --kv-cache-dtype fp8_e4m3 and you will get the fp8_e4m3 baseline numbers from the tables above — both KV modes are tested and stable on both branches. Current vLLM main / nightly builds after 2026-05-05 should already include the TurboQuant hybrid patch; manual PR #39931 patching is mainly for pinned releases and older vendor images.

Measured throughput under sustained load

Numbers below come from a 13-minute sustained run on a single RTX 5090 32 GB with the production config above (262 K context, max_num_seqs=64, turboquant_4bit_nc, chunked prefill, ignore_eos=true so every request runs to max_tokens):

Production-shape workload (25 concurrent, 30 K input + 2 000 forced output, 5 minutes sustained):

Metric Value
Requests completed / errors 100 / 0
Wall p50 / p95 92.3 s / 108.9 s
TTFT p50 / p95 2.04 s / 19.8 s
Aggregate decode 542 tok/s
Aggregate total (prefill + decode) 7,005 tok/s
Per-request decode 22.2 tok/s

Saturation ramp (15 K input + 1 500 forced output, 60 s per step):

Concurrent Wall p50 TTFT p50 Aggregate decode
30 50.7 s 1.81 s 865 tok/s (peak)
40 63.9 s 8.6 s 810 tok/s
50 71.7 s 10.8 s 793 tok/s
60 75.8 s 12.9 s 839 tok/s
64 77.1 s 13.7 s 854 tok/s

Sweet spot is around 30 concurrent: aggregate decode peaks at ~865 tok/s with sub-second-class TTFT (1.8 s p50). Above 30 concurrent throughput plateaus around 800–850 tok/s — the engine is decode-bound at that point — and TTFT grows from 1.8 s to 13 s as the scheduler queue backs up. The container served 471 requests across the full 13-minute stress without errors or degraded responses.

Very long single-request contexts (60 K, 120 K, 200 K tokens) do load and run in the same engine, but compete with the normal cohort for KV slots and prefill bandwidth. Under sustained 30-concurrent normal load, TTFT for a 60 K outlier was 73 s and a 200 K outlier did not complete in a 3-minute window. For batch workloads where outliers are rare (≤ 0.01 % of requests in our tests), either schedule them serially after the main cohort or route them through a dedicated low-priority worker; do not expect interactive latency on outliers while the engine is saturated by the normal cohort.

Runtime requirements per branch

Base requirement for both branches: a recent vLLM build with compressed-tensors NVFP4 support for Blackwell GPUs and the Qwen3.6 hybrid architecture. The table below lists the extra patch requirements on top of that base runtime.

Three independent vLLM patches may be needed depending on which branch you serve and which KV cache dtype you choose:

Patch Source When required
TurboQuant hybrid attention vLLM PR #39931 Any branch, if --kv-cache-dtype turboquant_* is set. Qwen3.6-35B-A3B is a hybrid architecture (self-attn + GDN linear-attn); the in-tree TurboQuant rejects hybrid models before this PR. The PR was merged to vllm-project/vllm:main on 2026-05-05, so current vLLM main/nightly builds after that date should already include it. Stock releases cut before that date, including v0.20.x, still need either the PR applied or a newer nightly/base image.
TurboQuant continuation-prefill workspace fix vLLM PR #40798 Any branch, if --kv-cache-dtype turboquant_* is set and chunked prefill is enabled (default in vLLM V1). Without this patch, prompts longer than max_num_batched_tokens trigger a _continuation_prefill workspace size assertion at runtime (see upstream issue #41726, #41565, #40420). The patch reserves the maximum-shape continuation_prefill workspace before locking it during CUDA-graph capture. Still open upstream at the time of this write-up; until it is merged, apply it as a source overlay on the vLLM Python package in the serving image. The fp8_e4m3 KV cache mode uses a different attention backend and does not require this patch.
Compressed-tensors FP8 head dispatch vllm_patches/ fp8-head branch only. The in-tree compressed-tensors dispatcher in vLLM 0.20.x routes only LinearBase, ParallelLMHead, Attention, and FusedMoE modules through quant schemes; FP8 weight loading on lm_head needs the additional dispatch patch shipped in this repo. An upstream PR for this is in progress.

Combined matrix:

Branch KV cache dtype Needs PR #39931? Needs PR #40798? Needs vllm_patches/?
main fp8_e4m3 (or default) no no no
main turboquant_4bit_nc (recommended) yes yes no
fp8-head fp8_e4m3 no no yes
fp8-head turboquant_4bit_nc (recommended) yes yes yes

If your vLLM build or base image already includes PR #39931 — for example a vLLM main/nightly build from after 2026-05-05 — you only need PR #40798 (for turboquant_* KV modes) and, on the fp8-head branch, the vllm_patches/ overlay. The main branch with fp8_e4m3 KV cache runs out of the box on any recent vLLM build.

What vllm_patches/ adds (fp8-head only)

The vllm_patches/ folder on the main branch contains:

  • apply_ct_fp8_lmhead_patch.py — idempotent source patcher. Adds the lm_head branch to CompressedTensorsConfig.get_quant_method, wires the FP8 weight + block-scale parameters into ParallelLMHead, and adjusts vocab_parallel_embedding.py so the FP8 scale companion loads cleanly. The patcher detects when upstream PR #41000-style wire-up is already present and skips the relevant patch points.
  • compressed_tensors_embedding.py — companion runtime module for the compressed-tensors FP8 embedding experiment. It is included because the patch bundle is shared with sibling checkpoints, but this 35 B repository does not ship an FP8-embedding branch.
  • Dockerfile.turboquant — minimal Docker overlay that applies the patcher on top of a vLLM image that already includes PR #39931 plus the upstream FP8 lm_head work (Red Hat / Neural Magic vLLM nightly). Builds in ~30 seconds.

Apply the patch (local dev)

# Clone a vLLM source tree. Stock v0.20.2 is fine for fp8_e4m3 KV-cache
# testing, but does NOT include TurboQuant hybrid attention support. For
# --kv-cache-dtype turboquant_* use vLLM main/nightly from 2026-05-05 or
# later, or a base image with PR #39931.
git clone --depth 1 --branch v0.20.2 https://github.com/vllm-project/vllm.git vllm-src
cd vllm-src

# Place the patcher and companion in /tmp and run the patcher against
# your vLLM install path.
cp /path/to/vllm_patches/apply_ct_fp8_lmhead_patch.py /tmp/
cp /path/to/vllm_patches/compressed_tensors_embedding.py /tmp/

python3 /tmp/apply_ct_fp8_lmhead_patch.py /path/to/site-packages

python3 -m py_compile \
  /path/to/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py \
  /path/to/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py \
  /path/to/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_embedding.py

The patcher is idempotent and detects when PR #41000 has already wired quant_config into ParallelLMHead; in that case it skips the first patch point. For the compressed-tensors FP8 head dispatch itself, it works on both stock v0.20.2 and on Red Hat / Neural Magic vLLM nightlies that already include the upstream lm_head FP8 work. TurboQuant hybrid support is separate and still requires PR #39931 or a newer vLLM build that includes it.

Docker overlay (recommended)

# inside vllm_patches/
docker build -f Dockerfile.turboquant -t local/vllm-qwen36-35b-a3b:patched .

The base image referenced in Dockerfile.turboquant already includes the TurboQuant hybrid patch (PR #39931) and the upstream FP8 lm_head work; the overlay only adds the compressed-tensors dispatch.

Inference example

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-anything")

resp = client.chat.completions.create(
    model="qwen35b-a3b",
    messages=[
        {"role": "user", "content": "Napis mi tri vety o vyznamu vcel pro zemedelstvi."},
    ],
    max_tokens=200,
    temperature=0.7,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(resp.choices[0].message.content)

The Qwen3.6 chat template supports both enable_thinking=true and enable_thinking=false. For tests and short prompts prefer false — thinking mode emits a reasoning trace before the final answer and a low max_tokens cap will cut the trace mid-flight, returning content=null.

Quality notes

The 35 B fp8-head branch passed the same short smoke set used for the production branch, including the Czech discount canary that caught the FP8-embedding regression on the 27 B sibling model. That result should not be read as proof that FP8 output projection is universally harmless; it is why the recommended workflow-engine branch remains main with BF16 lm_head and BF16 embed_tokens.

For structured output, tool routing, JSON, enum selection, or long-running agent workflows, prefer main. For summarization, chat, and prose-heavy generation where a small output-projection quantization risk is acceptable, fp8-head gives about 9 % more full-32K concurrency with TurboQuant.

Credits

License

Apache-2.0, inherited from the base Qwen/Qwen3.6-35B-A3B model.

Downloads last month
2,433
Safetensors
Model size
21B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inferRouter/Qwen3.6-35B-A3B-NVFP4

Quantized
(419)
this model