Instructions to use inferRouter/Qwen3.6-35B-A3B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use inferRouter/Qwen3.6-35B-A3B-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="inferRouter/Qwen3.6-35B-A3B-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("inferRouter/Qwen3.6-35B-A3B-NVFP4") model = AutoModelForImageTextToText.from_pretrained("inferRouter/Qwen3.6-35B-A3B-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use inferRouter/Qwen3.6-35B-A3B-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "inferRouter/Qwen3.6-35B-A3B-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inferRouter/Qwen3.6-35B-A3B-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/inferRouter/Qwen3.6-35B-A3B-NVFP4
- SGLang
How to use inferRouter/Qwen3.6-35B-A3B-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "inferRouter/Qwen3.6-35B-A3B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inferRouter/Qwen3.6-35B-A3B-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "inferRouter/Qwen3.6-35B-A3B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inferRouter/Qwen3.6-35B-A3B-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use inferRouter/Qwen3.6-35B-A3B-NVFP4 with Docker Model Runner:
docker model run hf.co/inferRouter/Qwen3.6-35B-A3B-NVFP4
Qwen3.6-35B-A3B-NVFP4
Mixed-precision NVFP4 (+ optional FP8 lm_head) quantization of Qwen/Qwen3.6-35B-A3B — the Mixture-of-Experts variant with 35 B total parameters and ~3 B active per token — targeting native Blackwell (SM120) deployment, primarily RTX 5090 32 GB.
The original BF16 checkpoint needs ~67 GiB of VRAM. This build fits a single RTX 5090 32 GB at 32 K context with usable KV cache, multi-turn reasoning, hybrid attention (self-attn + GDN linear-attn), and tool-call support.
TL;DR
- Use
main+--kv-cache-dtype turboquant_4bit_ncfor production workflow / tool-call / structured-output serving. - Use
fp8-head+--kv-cache-dtype turboquant_4bit_ncwhen you want the highest concurrency and the workload is mostly free-form text. - Do not expect an
fp8-head-embedbranch here. FP8 embeddings were intentionally skipped after a regression canary on the 27 B sibling model. - The reported concurrency is measured on a single RTX 5090 32 GB with
max_model_len=32768,max_num_seqs=16, and CUDA graphs enabled.
Variants
The repository hosts two branches:
| Branch | lm_head dtype |
embed_tokens dtype |
Intended deployment |
|---|---|---|---|
main |
BF16 | BF16 | Production default. Workflow / agent engine — preserves output-token precision for JSON, tool calls, enums. |
fp8-head |
FP8_BLOCK [128, 128] | BF16 | Free-form text and chat at higher concurrency. ~10 % more KV-cache headroom than main at the cost of a small FP8 quantization of the output projection. |
Inner Linear layers (every MoE expert projection plus self-attn projection) use the same uniform NVFP4 calibration on both branches. The calibrated NVFP4 inner weights are bit-identical across the two branches; only the output projection differs.
VRAM and concurrency (RTX 5090 32 GB)
Measured on a single RTX 5090 with gpu_memory_utilization=0.93,
max_model_len=32768, max_num_seqs=16, max_num_batched_tokens=4096,
dtype=bfloat16.
| Branch | KV cache dtype | Weights | KV cache | GPU KV cache | Max concurrency @ 32 K | Decode tok/s |
|---|---|---|---|---|---|---|
main (bf16head) |
fp8_e4m3 |
21.88 GiB | 4.63 GiB | 119,472 tok | 12.16× | 182 |
main (bf16head) |
turboquant_4bit_nc |
21.96 GiB | 4.94 GiB | 249,856 tok | 22.45× | 175 |
fp8-head (fp8lm) |
fp8_e4m3 |
21.40 GiB | 5.10 GiB | 132,048 tok | 13.42× | 193 |
fp8-head (fp8lm) |
turboquant_4bit_nc |
21.48 GiB | 5.41 GiB | 274,432 tok | 24.55× | 184 |
TurboQuant 4-bit non-causal KV cache roughly doubles the KV pool versus
fp8_e4m3 at the cost of ~4–5 % single-stream decode. The recommended
deploy mode is turboquant_4bit_nc for both branches.
Scaling up to 262 K context (production config)
The numbers above are the conservative Gate-4 baseline (32 K context,
max_num_seqs=16, gpu_memory_utilization=0.93). For production workloads
the same artifact serves cleanly at the full Qwen3.6-35B-A3B context window
(262 144 tokens) with gpu_memory_utilization=0.95 and max_num_seqs=64,
which gives a substantially larger KV pool because the block-allocation
strategy with longer contexts reduces partial-block waste:
| Branch | KV cache dtype | Weights | KV cache | GPU KV cache | Max concurrency @ 262 K |
|---|---|---|---|---|---|
main (bf16head) |
turboquant_4bit_nc |
21.96 GiB | 5.48 GiB | 1,022,361 tok | 3.90× |
main (bf16head) |
fp8_e4m3 |
21.88 GiB | 4.97 GiB | 492,512 tok | 1.88× |
fp8-head (fp8lm) |
turboquant_4bit_nc |
21.48 GiB | 5.48 GiB | ~1,022,000 tok | ~3.9× |
A single endpoint at max_model_len=262144 therefore serves both ordinary
short requests (where max_num_seqs=64 is the binding ceiling) and very
long-context requests (3–4× concurrent at full 262 K) from the same engine,
without needing a separate "long-context lane".
Branch selection
- Workflow / agent engine, tool calls, structured output: use
main. BF16 lm_head keeps logit ordering intact for sampling sensitive tokens (JSON braces, enum values, IDs, dates). - Free-form text generation, chat, summarization: use
fp8-headfor the extra concurrency headroom at the same context length.
Why no fp8-head-embed branch
The sibling inferRouter/Qwen3.6-27B-NVFP4 repository ships a third
fp8-head-embed branch (FP8 lm_head and FP8 embed) marked lab only
because a regression canary on a short Czech arithmetic prompt reproducibly
flipped the model's final discount comparison while the intermediate
calculation was still correct. The 35 B variant does not ship an
embed_tokens-FP8 branch — the same risk applies and the saving on this
architecture (~0.47 GiB) does not justify shipping a lab-only artifact.
Architecture / quantization summary
- Backbone:
Qwen3_5MoeForConditionalGeneration - 40 transformer layers; hybrid attention (16 self-attention layers interleaved with 24 GDN linear-attention / Mamba-style layers)
- 256 routed experts per MoE block, top-K routed
- NVFP4 (W4A4, group size 16) on every
Linearnot in the ignore list - BF16 stays on: visual-encoder blocks (vision tower retained for
image-text-to-text), every layer's
linear_attn.*, every layer'smlp.gateandmlp.shared_expert_gate, MTP graft, and (onmain)lm_head+embed_tokens. - On
fp8-head:lm_headis FP8_BLOCK [128, 128] (block-quantized float-8 with a per-128-row scale grid).
A frozen Multi-Token-Prediction head (model_mtp.safetensors, ~1.6 GB) is
included for compatibility with vLLM speculative-decode setups; it is not
loaded by default.
Recipe and calibration
This build follows the spirit of Red Hat / Neural Magic's LLM Compressor recipes for NVFP4 MoE checkpoints, with an adjusted calibration mix tuned for Czech-language robustness and Czech legal-domain coverage in addition to the usual English chat / math / code / multilingual diet. The raw calibration corpus is not redistributed with this model card; the public artifact records the important reproducibility metadata in the checkpoint config and branch layout.
Calibration ran one full pass at 1280 samples and 8192 max sequence length
on the BF16 source checkpoint, producing the uniform NVFP4 inner used by
both branches. The fp8-head branch keeps the calibrated FP8_BLOCK
lm_head; the main branch surgically restores the original BF16
lm_head from the base checkpoint. This avoids a second 28-hour calibration
run while preserving the same NVFP4 inner numerics.
Files
Per branch:
chat_template.jinja
config.json
generation_config.json
model.safetensors (~22–23 GB; differs across branches by ~0.5 GB)
model.safetensors.index.json
model_mtp.safetensors (~1.6 GB; identical across branches)
processor_config.json
recipe.yaml
tokenizer.json
tokenizer_config.json
The main branch additionally hosts a vllm_patches/
folder with the source overlay needed to dispatch FP8 weights on lm_head
through the compressed-tensors path when serving the fp8-head branch on
vLLM versions that have not yet merged the upstream PR.
Recommended vLLM serve config
Single RTX 5090 32 GB, full 262 K context, production-tuned,
turboquant_4bit_nc KV cache, max 64 in-flight sequences:
vllm serve inferRouter/Qwen3.6-35B-A3B-NVFP4 \
--revision main \
--served-model-name qwen35b-a3b \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95 \
--max-model-len 262144 \
--max-num-seqs 64 \
--max-num-batched-tokens 8192 \
--kv-cache-dtype turboquant_4bit_nc \
--enable-chunked-prefill \
--enable-prefix-caching \
--dtype bfloat16 \
--trust-remote-code
For the fp8-head branch, change --revision to fp8-head and keep
everything else the same.
If you do not need the full 262 K window and want a slightly safer
VRAM headroom (extra ~0.5 GiB), drop --max-model-len to 32768 and
--gpu-memory-utilization to 0.93 — that is the Gate-4 baseline whose
numbers are in the upper VRAM table.
max_num_batched_tokens must be at least 4096 because the GDN linear-attn
layers in this architecture impose a Mamba block-size constraint of 3072
tokens. The recommended value of 8192 keeps long-prompt TTFT reasonable
without exhausting per-iteration GPU memory.
If your vLLM image does not include the upstream TurboQuant hybrid
attention support yet, swap --kv-cache-dtype turboquant_4bit_nc for
--kv-cache-dtype fp8_e4m3 and you will get the fp8_e4m3 baseline
numbers from the tables above — both KV modes are tested and stable on
both branches. Current vLLM main / nightly builds after 2026-05-05 should
already include the TurboQuant hybrid patch; manual PR #39931 patching is
mainly for pinned releases and older vendor images.
Measured throughput under sustained load
Numbers below come from a 13-minute sustained run on a single RTX 5090
32 GB with the production config above (262 K context, max_num_seqs=64,
turboquant_4bit_nc, chunked prefill, ignore_eos=true so every request
runs to max_tokens):
Production-shape workload (25 concurrent, 30 K input + 2 000 forced output, 5 minutes sustained):
| Metric | Value |
|---|---|
| Requests completed / errors | 100 / 0 |
| Wall p50 / p95 | 92.3 s / 108.9 s |
| TTFT p50 / p95 | 2.04 s / 19.8 s |
| Aggregate decode | 542 tok/s |
| Aggregate total (prefill + decode) | 7,005 tok/s |
| Per-request decode | 22.2 tok/s |
Saturation ramp (15 K input + 1 500 forced output, 60 s per step):
| Concurrent | Wall p50 | TTFT p50 | Aggregate decode |
|---|---|---|---|
| 30 | 50.7 s | 1.81 s | 865 tok/s (peak) |
| 40 | 63.9 s | 8.6 s | 810 tok/s |
| 50 | 71.7 s | 10.8 s | 793 tok/s |
| 60 | 75.8 s | 12.9 s | 839 tok/s |
| 64 | 77.1 s | 13.7 s | 854 tok/s |
Sweet spot is around 30 concurrent: aggregate decode peaks at ~865 tok/s with sub-second-class TTFT (1.8 s p50). Above 30 concurrent throughput plateaus around 800–850 tok/s — the engine is decode-bound at that point — and TTFT grows from 1.8 s to 13 s as the scheduler queue backs up. The container served 471 requests across the full 13-minute stress without errors or degraded responses.
Very long single-request contexts (60 K, 120 K, 200 K tokens) do load and run in the same engine, but compete with the normal cohort for KV slots and prefill bandwidth. Under sustained 30-concurrent normal load, TTFT for a 60 K outlier was 73 s and a 200 K outlier did not complete in a 3-minute window. For batch workloads where outliers are rare (≤ 0.01 % of requests in our tests), either schedule them serially after the main cohort or route them through a dedicated low-priority worker; do not expect interactive latency on outliers while the engine is saturated by the normal cohort.
Runtime requirements per branch
Base requirement for both branches: a recent vLLM build with compressed-tensors NVFP4 support for Blackwell GPUs and the Qwen3.6 hybrid architecture. The table below lists the extra patch requirements on top of that base runtime.
Three independent vLLM patches may be needed depending on which branch you serve and which KV cache dtype you choose:
| Patch | Source | When required |
|---|---|---|
| TurboQuant hybrid attention | vLLM PR #39931 | Any branch, if --kv-cache-dtype turboquant_* is set. Qwen3.6-35B-A3B is a hybrid architecture (self-attn + GDN linear-attn); the in-tree TurboQuant rejects hybrid models before this PR. The PR was merged to vllm-project/vllm:main on 2026-05-05, so current vLLM main/nightly builds after that date should already include it. Stock releases cut before that date, including v0.20.x, still need either the PR applied or a newer nightly/base image. |
| TurboQuant continuation-prefill workspace fix | vLLM PR #40798 | Any branch, if --kv-cache-dtype turboquant_* is set and chunked prefill is enabled (default in vLLM V1). Without this patch, prompts longer than max_num_batched_tokens trigger a _continuation_prefill workspace size assertion at runtime (see upstream issue #41726, #41565, #40420). The patch reserves the maximum-shape continuation_prefill workspace before locking it during CUDA-graph capture. Still open upstream at the time of this write-up; until it is merged, apply it as a source overlay on the vLLM Python package in the serving image. The fp8_e4m3 KV cache mode uses a different attention backend and does not require this patch. |
| Compressed-tensors FP8 head dispatch | vllm_patches/ |
fp8-head branch only. The in-tree compressed-tensors dispatcher in vLLM 0.20.x routes only LinearBase, ParallelLMHead, Attention, and FusedMoE modules through quant schemes; FP8 weight loading on lm_head needs the additional dispatch patch shipped in this repo. An upstream PR for this is in progress. |
Combined matrix:
| Branch | KV cache dtype | Needs PR #39931? | Needs PR #40798? | Needs vllm_patches/? |
|---|---|---|---|---|
main |
fp8_e4m3 (or default) |
no | no | no |
main |
turboquant_4bit_nc (recommended) |
yes | yes | no |
fp8-head |
fp8_e4m3 |
no | no | yes |
fp8-head |
turboquant_4bit_nc (recommended) |
yes | yes | yes |
If your vLLM build or base image already includes PR #39931 — for example
a vLLM main/nightly build from after 2026-05-05 — you only need PR #40798
(for turboquant_* KV modes) and, on the fp8-head branch, the
vllm_patches/ overlay. The main branch with fp8_e4m3 KV cache runs
out of the box on any recent vLLM build.
What vllm_patches/ adds (fp8-head only)
The vllm_patches/ folder on the main branch contains:
apply_ct_fp8_lmhead_patch.py— idempotent source patcher. Adds thelm_headbranch toCompressedTensorsConfig.get_quant_method, wires the FP8 weight + block-scale parameters intoParallelLMHead, and adjustsvocab_parallel_embedding.pyso the FP8 scale companion loads cleanly. The patcher detects when upstream PR #41000-style wire-up is already present and skips the relevant patch points.compressed_tensors_embedding.py— companion runtime module for the compressed-tensors FP8 embedding experiment. It is included because the patch bundle is shared with sibling checkpoints, but this 35 B repository does not ship an FP8-embedding branch.Dockerfile.turboquant— minimal Docker overlay that applies the patcher on top of a vLLM image that already includes PR #39931 plus the upstream FP8 lm_head work (Red Hat / Neural Magic vLLM nightly). Builds in ~30 seconds.
Apply the patch (local dev)
# Clone a vLLM source tree. Stock v0.20.2 is fine for fp8_e4m3 KV-cache
# testing, but does NOT include TurboQuant hybrid attention support. For
# --kv-cache-dtype turboquant_* use vLLM main/nightly from 2026-05-05 or
# later, or a base image with PR #39931.
git clone --depth 1 --branch v0.20.2 https://github.com/vllm-project/vllm.git vllm-src
cd vllm-src
# Place the patcher and companion in /tmp and run the patcher against
# your vLLM install path.
cp /path/to/vllm_patches/apply_ct_fp8_lmhead_patch.py /tmp/
cp /path/to/vllm_patches/compressed_tensors_embedding.py /tmp/
python3 /tmp/apply_ct_fp8_lmhead_patch.py /path/to/site-packages
python3 -m py_compile \
/path/to/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py \
/path/to/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py \
/path/to/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_embedding.py
The patcher is idempotent and detects when PR #41000 has already wired
quant_config into ParallelLMHead; in that case it skips the first
patch point. For the compressed-tensors FP8 head dispatch itself, it
works on both stock v0.20.2 and on
Red Hat / Neural Magic vLLM nightlies
that already include the upstream lm_head FP8 work. TurboQuant hybrid
support is separate and still requires PR #39931 or a newer vLLM build
that includes it.
Docker overlay (recommended)
# inside vllm_patches/
docker build -f Dockerfile.turboquant -t local/vllm-qwen36-35b-a3b:patched .
The base image referenced in Dockerfile.turboquant already includes the
TurboQuant hybrid patch (PR #39931) and the upstream FP8 lm_head work;
the overlay only adds the compressed-tensors dispatch.
Inference example
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-anything")
resp = client.chat.completions.create(
model="qwen35b-a3b",
messages=[
{"role": "user", "content": "Napis mi tri vety o vyznamu vcel pro zemedelstvi."},
],
max_tokens=200,
temperature=0.7,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(resp.choices[0].message.content)
The Qwen3.6 chat template supports both enable_thinking=true and
enable_thinking=false. For tests and short prompts prefer false —
thinking mode emits a reasoning trace before the final answer and a low
max_tokens cap will cut the trace mid-flight, returning content=null.
Quality notes
The 35 B fp8-head branch passed the same short smoke set used for the
production branch, including the Czech discount canary that caught the
FP8-embedding regression on the 27 B sibling model. That result should not
be read as proof that FP8 output projection is universally harmless; it is
why the recommended workflow-engine branch remains main with BF16
lm_head and BF16 embed_tokens.
For structured output, tool routing, JSON, enum selection, or long-running
agent workflows, prefer main. For summarization, chat, and prose-heavy
generation where a small output-projection quantization risk is acceptable,
fp8-head gives about 9 % more full-32K concurrency with TurboQuant.
Credits
- Qwen team for the base Qwen3.6-35B-A3B checkpoint and the MoE architecture.
- Red Hat / Neural Magic LLM Compressor team for the NVFP4 calibration recipe this build is derived from.
- vLLM project for the serving runtime, the TurboQuant hybrid attention work (PR #39931), and the upstream FP8 lm_head work.
License
Apache-2.0, inherited from the base Qwen/Qwen3.6-35B-A3B model.
- Downloads last month
- 2,433
Model tree for inferRouter/Qwen3.6-35B-A3B-NVFP4
Base model
Qwen/Qwen3.6-35B-A3B