Instructions to use inferRouter/Qwen3.6-35B-A3B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use inferRouter/Qwen3.6-35B-A3B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="inferRouter/Qwen3.6-35B-A3B-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("inferRouter/Qwen3.6-35B-A3B-NVFP4")
model = AutoModelForImageTextToText.from_pretrained("inferRouter/Qwen3.6-35B-A3B-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use inferRouter/Qwen3.6-35B-A3B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "inferRouter/Qwen3.6-35B-A3B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inferRouter/Qwen3.6-35B-A3B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/inferRouter/Qwen3.6-35B-A3B-NVFP4

SGLang

How to use inferRouter/Qwen3.6-35B-A3B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "inferRouter/Qwen3.6-35B-A3B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inferRouter/Qwen3.6-35B-A3B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "inferRouter/Qwen3.6-35B-A3B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inferRouter/Qwen3.6-35B-A3B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use inferRouter/Qwen3.6-35B-A3B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/inferRouter/Qwen3.6-35B-A3B-NVFP4
```

Qwen3.6-35B-A3B-NVFP4

Mixed-precision NVFP4 (+ optional FP8 lm_head) quantization of Qwen/Qwen3.6-35B-A3B — the Mixture-of-Experts variant with 35 B total parameters and ~3 B active per token — targeting native Blackwell (SM120) deployment, primarily RTX 5090 32 GB.

The original BF16 checkpoint needs ~67 GiB of VRAM. This build fits a single RTX 5090 32 GB at 32 K context with usable KV cache, multi-turn reasoning, hybrid attention (self-attn + GDN linear-attn), and tool-call support.

TL;DR

Use main + --kv-cache-dtype turboquant_4bit_nc for production workflow / tool-call / structured-output serving.
Use fp8-head + --kv-cache-dtype turboquant_4bit_nc when you want the highest concurrency and the workload is mostly free-form text.
Do not expect an fp8-head-embed branch here. FP8 embeddings were intentionally skipped after a regression canary on the 27 B sibling model.
The reported concurrency is measured on a single RTX 5090 32 GB with max_model_len=32768, max_num_seqs=16, and CUDA graphs enabled.

Variants

The repository hosts two branches:

Branch	`lm_head` dtype	`embed_tokens` dtype	Intended deployment
`main`	BF16	BF16	Production default. Workflow / agent engine — preserves output-token precision for JSON, tool calls, enums.
`fp8-head`	FP8_BLOCK [128, 128]	BF16	Free-form text and chat at higher concurrency. ~10 % more KV-cache headroom than `main` at the cost of a small FP8 quantization of the output projection.

Inner Linear layers (every MoE expert projection plus self-attn projection) use the same uniform NVFP4 calibration on both branches. The calibrated NVFP4 inner weights are bit-identical across the two branches; only the output projection differs.

VRAM and concurrency (RTX 5090 32 GB)

Measured on a single RTX 5090 with gpu_memory_utilization=0.93, max_model_len=32768, max_num_seqs=16, max_num_batched_tokens=4096, dtype=bfloat16.

Branch	KV cache dtype	Weights	KV cache	GPU KV cache	Max concurrency @ 32 K	Decode tok/s
`main` (bf16head)	`fp8_e4m3`	21.88 GiB	4.63 GiB	119,472 tok	12.16×	182
`main` (bf16head)	`turboquant_4bit_nc`	21.96 GiB	4.94 GiB	249,856 tok	22.45×	175
`fp8-head` (fp8lm)	`fp8_e4m3`	21.40 GiB	5.10 GiB	132,048 tok	13.42×	193
`fp8-head` (fp8lm)	`turboquant_4bit_nc`	21.48 GiB	5.41 GiB	274,432 tok	24.55×	184

TurboQuant 4-bit non-causal KV cache roughly doubles the KV pool versus fp8_e4m3 at the cost of ~4–5 % single-stream decode. The recommended deploy mode is turboquant_4bit_nc for both branches.

Scaling up to 262 K context (production config)

The numbers above are the conservative Gate-4 baseline (32 K context, max_num_seqs=16, gpu_memory_utilization=0.93). For production workloads the same artifact serves cleanly at the full Qwen3.6-35B-A3B context window (262 144 tokens) with gpu_memory_utilization=0.95 and max_num_seqs=64, which gives a substantially larger KV pool because the block-allocation strategy with longer contexts reduces partial-block waste:

Branch	KV cache dtype	Weights	KV cache	GPU KV cache	Max concurrency @ 262 K
`main` (bf16head)	`turboquant_4bit_nc`	21.96 GiB	5.48 GiB	1,022,361 tok	3.90×
`main` (bf16head)	`fp8_e4m3`	21.88 GiB	4.97 GiB	492,512 tok	1.88×
`fp8-head` (fp8lm)	`turboquant_4bit_nc`	21.48 GiB	5.48 GiB	~1,022,000 tok	~3.9×

A single endpoint at max_model_len=262144 therefore serves both ordinary short requests (where max_num_seqs=64 is the binding ceiling) and very long-context requests (3–4× concurrent at full 262 K) from the same engine, without needing a separate "long-context lane".

Branch selection

Workflow / agent engine, tool calls, structured output: use main. BF16 lm_head keeps logit ordering intact for sampling sensitive tokens (JSON braces, enum values, IDs, dates).
Free-form text generation, chat, summarization: use fp8-head for the extra concurrency headroom at the same context length.

Why no `fp8-head-embed` branch

The sibling inferRouter/Qwen3.6-27B-NVFP4 repository ships a third fp8-head-embed branch (FP8 lm_head and FP8 embed) marked lab only because a regression canary on a short Czech arithmetic prompt reproducibly flipped the model's final discount comparison while the intermediate calculation was still correct. The 35 B variant does not ship an embed_tokens-FP8 branch — the same risk applies and the saving on this architecture (~0.47 GiB) does not justify shipping a lab-only artifact.

Architecture / quantization summary

Backbone: Qwen3_5MoeForConditionalGeneration
40 transformer layers; hybrid attention (16 self-attention layers interleaved with 24 GDN linear-attention / Mamba-style layers)
256 routed experts per MoE block, top-K routed
NVFP4 (W4A4, group size 16) on every Linear not in the ignore list
BF16 stays on: visual-encoder blocks (vision tower retained for image-text-to-text), every layer's linear_attn.*, every layer's mlp.gate and mlp.shared_expert_gate, MTP graft, and (on main) lm_head + embed_tokens.
On fp8-head: lm_head is FP8_BLOCK [128, 128] (block-quantized float-8 with a per-128-row scale grid).

A frozen Multi-Token-Prediction head (model_mtp.safetensors, ~1.6 GB) is included for compatibility with vLLM speculative-decode setups; it is not loaded by default.

Recipe and calibration

This build follows the spirit of Red Hat / Neural Magic's LLM Compressor recipes for NVFP4 MoE checkpoints, with an adjusted calibration mix tuned for Czech-language robustness and Czech legal-domain coverage in addition to the usual English chat / math / code / multilingual diet. The raw calibration corpus is not redistributed with this model card; the public artifact records the important reproducibility metadata in the checkpoint config and branch layout.

Calibration ran one full pass at 1280 samples and 8192 max sequence length on the BF16 source checkpoint, producing the uniform NVFP4 inner used by both branches. The fp8-head branch keeps the calibrated FP8_BLOCK lm_head; the main branch surgically restores the original BF16 lm_head from the base checkpoint. This avoids a second 28-hour calibration run while preserving the same NVFP4 inner numerics.

Files

Per branch:

chat_template.jinja
config.json
generation_config.json
model.safetensors                (~22–23 GB; differs across branches by ~0.5 GB)
model.safetensors.index.json
model_mtp.safetensors            (~1.6 GB; identical across branches)
processor_config.json
recipe.yaml
tokenizer.json
tokenizer_config.json

The main branch additionally hosts a vllm_patches/ folder with the source overlay needed to dispatch FP8 weights on lm_head through the compressed-tensors path when serving the fp8-head branch on vLLM versions that have not yet merged the upstream PR.

Recommended vLLM serve config

Single RTX 5090 32 GB, full 262 K context, production-tuned, turboquant_4bit_nc KV cache, max 64 in-flight sequences:

vllm serve inferRouter/Qwen3.6-35B-A3B-NVFP4 \
  --revision main \
  --served-model-name qwen35b-a3b \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 262144 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 8192 \
  --kv-cache-dtype turboquant_4bit_nc \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --dtype bfloat16 \
  --trust-remote-code

For the fp8-head branch, change --revision to fp8-head and keep everything else the same.

If you do not need the full 262 K window and want a slightly safer VRAM headroom (extra ~0.5 GiB), drop --max-model-len to 32768 and --gpu-memory-utilization to 0.93 — that is the Gate-4 baseline whose numbers are in the upper VRAM table.

max_num_batched_tokens must be at least 4096 because the GDN linear-attn layers in this architecture impose a Mamba block-size constraint of 3072 tokens. The recommended value of 8192 keeps long-prompt TTFT reasonable without exhausting per-iteration GPU memory.

If your vLLM image does not include the upstream TurboQuant hybrid attention support yet, swap --kv-cache-dtype turboquant_4bit_nc for --kv-cache-dtype fp8_e4m3 and you will get the fp8_e4m3 baseline numbers from the tables above — both KV modes are tested and stable on both branches. Current vLLM main / nightly builds after 2026-05-05 should already include the TurboQuant hybrid patch; manual PR #39931 patching is mainly for pinned releases and older vendor images.

Measured throughput under sustained load

Numbers below come from a 13-minute sustained run on a single RTX 5090 32 GB with the production config above (262 K context, max_num_seqs=64, turboquant_4bit_nc, chunked prefill, ignore_eos=true so every request runs to max_tokens):

Production-shape workload (25 concurrent, 30 K input + 2 000 forced output, 5 minutes sustained):

Metric	Value
Requests completed / errors	100 / 0
Wall p50 / p95	92.3 s / 108.9 s
TTFT p50 / p95	2.04 s / 19.8 s
Aggregate decode	542 tok/s
Aggregate total (prefill + decode)	7,005 tok/s
Per-request decode	22.2 tok/s

Saturation ramp (15 K input + 1 500 forced output, 60 s per step):

Concurrent	Wall p50	TTFT p50	Aggregate decode
30	50.7 s	1.81 s	865 tok/s (peak)
40	63.9 s	8.6 s	810 tok/s
50	71.7 s	10.8 s	793 tok/s
60	75.8 s	12.9 s	839 tok/s
64	77.1 s	13.7 s	854 tok/s

Sweet spot is around 30 concurrent: aggregate decode peaks at ~865 tok/s with sub-second-class TTFT (1.8 s p50). Above 30 concurrent throughput plateaus around 800–850 tok/s — the engine is decode-bound at that point — and TTFT grows from 1.8 s to 13 s as the scheduler queue backs up. The container served 471 requests across the full 13-minute stress without errors or degraded responses.

Very long single-request contexts (60 K, 120 K, 200 K tokens) do load and run in the same engine, but compete with the normal cohort for KV slots and prefill bandwidth. Under sustained 30-concurrent normal load, TTFT for a 60 K outlier was 73 s and a 200 K outlier did not complete in a 3-minute window. For batch workloads where outliers are rare (≤ 0.01 % of requests in our tests), either schedule them serially after the main cohort or route them through a dedicated low-priority worker; do not expect interactive latency on outliers while the engine is saturated by the normal cohort.

Runtime requirements per branch

Base requirement for both branches: a recent vLLM build with compressed-tensors NVFP4 support for Blackwell GPUs and the Qwen3.6 hybrid architecture. The table below lists the extra patch requirements on top of that base runtime.

Three independent vLLM patches may be needed depending on which branch you serve and which KV cache dtype you choose:

Patch	Source	When required
TurboQuant hybrid attention	vLLM PR #39931	Any branch, if `--kv-cache-dtype turboquant_*` is set. Qwen3.6-35B-A3B is a hybrid architecture (self-attn + GDN linear-attn); the in-tree TurboQuant rejects hybrid models before this PR. The PR was merged to `vllm-project/vllm:main` on 2026-05-05, so current vLLM main/nightly builds after that date should already include it. Stock releases cut before that date, including v0.20.x, still need either the PR applied or a newer nightly/base image.
TurboQuant continuation-prefill workspace fix	vLLM PR #40798	Any branch, if `--kv-cache-dtype turboquant_` is set and* chunked prefill is enabled (default in vLLM V1). Without this patch, prompts longer than `max_num_batched_tokens` trigger a `_continuation_prefill` workspace size assertion at runtime (see upstream issue #41726, #41565, #40420). The patch reserves the maximum-shape `continuation_prefill` workspace before locking it during CUDA-graph capture. Still open upstream at the time of this write-up; until it is merged, apply it as a source overlay on the vLLM Python package in the serving image. The `fp8_e4m3` KV cache mode uses a different attention backend and does not require this patch.
Compressed-tensors FP8 head dispatch	`vllm_patches/`	`fp8-head` branch only. The in-tree `compressed-tensors` dispatcher in vLLM 0.20.x routes only `LinearBase`, `ParallelLMHead`, `Attention`, and `FusedMoE` modules through quant schemes; FP8 weight loading on `lm_head` needs the additional dispatch patch shipped in this repo. An upstream PR for this is in progress.

Combined matrix:

Branch	KV cache dtype	Needs PR #39931?	Needs PR #40798?	Needs `vllm_patches/`?
`main`	`fp8_e4m3` (or default)	no	no	no
`main`	`turboquant_4bit_nc` (recommended)	yes	yes	no
`fp8-head`	`fp8_e4m3`	no	no	yes
`fp8-head`	`turboquant_4bit_nc` (recommended)	yes	yes	yes

If your vLLM build or base image already includes PR #39931 — for example a vLLM main/nightly build from after 2026-05-05 — you only need PR #40798 (for turboquant_* KV modes) and, on the fp8-head branch, the vllm_patches/ overlay. The main branch with fp8_e4m3 KV cache runs out of the box on any recent vLLM build.

What `vllm_patches/` adds (fp8-head only)

The vllm_patches/ folder on the main branch contains:

apply_ct_fp8_lmhead_patch.py — idempotent source patcher. Adds the lm_head branch to CompressedTensorsConfig.get_quant_method, wires the FP8 weight + block-scale parameters into ParallelLMHead, and adjusts vocab_parallel_embedding.py so the FP8 scale companion loads cleanly. The patcher detects when upstream PR #41000-style wire-up is already present and skips the relevant patch points.
compressed_tensors_embedding.py — companion runtime module for the compressed-tensors FP8 embedding experiment. It is included because the patch bundle is shared with sibling checkpoints, but this 35 B repository does not ship an FP8-embedding branch.
Dockerfile.turboquant — minimal Docker overlay that applies the patcher on top of a vLLM image that already includes PR #39931 plus the upstream FP8 lm_head work (Red Hat / Neural Magic vLLM nightly). Builds in ~30 seconds.

Apply the patch (local dev)

# Clone a vLLM source tree. Stock v0.20.2 is fine for fp8_e4m3 KV-cache
# testing, but does NOT include TurboQuant hybrid attention support. For
# --kv-cache-dtype turboquant_* use vLLM main/nightly from 2026-05-05 or
# later, or a base image with PR #39931.
git clone --depth 1 --branch v0.20.2 https://github.com/vllm-project/vllm.git vllm-src
cd vllm-src

# Place the patcher and companion in /tmp and run the patcher against
# your vLLM install path.
cp /path/to/vllm_patches/apply_ct_fp8_lmhead_patch.py /tmp/
cp /path/to/vllm_patches/compressed_tensors_embedding.py /tmp/

python3 /tmp/apply_ct_fp8_lmhead_patch.py /path/to/site-packages

python3 -m py_compile \
  /path/to/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py \
  /path/to/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py \
  /path/to/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_embedding.py

The patcher is idempotent and detects when PR #41000 has already wired quant_config into ParallelLMHead; in that case it skips the first patch point. For the compressed-tensors FP8 head dispatch itself, it works on both stock v0.20.2 and on Red Hat / Neural Magic vLLM nightlies that already include the upstream lm_head FP8 work. TurboQuant hybrid support is separate and still requires PR #39931 or a newer vLLM build that includes it.

Docker overlay (recommended)

# inside vllm_patches/
docker build -f Dockerfile.turboquant -t local/vllm-qwen36-35b-a3b:patched .

The base image referenced in Dockerfile.turboquant already includes the TurboQuant hybrid patch (PR #39931) and the upstream FP8 lm_head work; the overlay only adds the compressed-tensors dispatch.

Inference example

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-anything")

resp = client.chat.completions.create(
    model="qwen35b-a3b",
    messages=[
        {"role": "user", "content": "Napis mi tri vety o vyznamu vcel pro zemedelstvi."},
    ],
    max_tokens=200,
    temperature=0.7,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(resp.choices[0].message.content)

The Qwen3.6 chat template supports both enable_thinking=true and enable_thinking=false. For tests and short prompts prefer false — thinking mode emits a reasoning trace before the final answer and a low max_tokens cap will cut the trace mid-flight, returning content=null.

Quality notes

The 35 B fp8-head branch passed the same short smoke set used for the production branch, including the Czech discount canary that caught the FP8-embedding regression on the 27 B sibling model. That result should not be read as proof that FP8 output projection is universally harmless; it is why the recommended workflow-engine branch remains main with BF16 lm_head and BF16 embed_tokens.

For structured output, tool routing, JSON, enum selection, or long-running agent workflows, prefer main. For summarization, chat, and prose-heavy generation where a small output-projection quantization risk is acceptable, fp8-head gives about 9 % more full-32K concurrency with TurboQuant.

Credits

Qwen team for the base Qwen3.6-35B-A3B checkpoint and the MoE architecture.
Red Hat / Neural Magic LLM Compressor team for the NVFP4 calibration recipe this build is derived from.
vLLM project for the serving runtime, the TurboQuant hybrid attention work (PR #39931), and the upstream FP8 lm_head work.

License

Apache-2.0, inherited from the base Qwen/Qwen3.6-35B-A3B model.

Downloads last month: 2,433

Safetensors

Model size

21B params

Tensor type

F32

BF16

F8_E4M3

Model tree for inferRouter/Qwen3.6-35B-A3B-NVFP4

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(419)

this model