Qwen3.6-27B-FP8 with FP8 lm_head and FP8 embed_tokens

🚀 RTX 5090 FP8 production candidate (within envelope). This checkpoint keeps the official Qwen3.6-27B-FP8 weight quality path, then quantizes the two large vocabulary tables (lm_head and model.language_model.embed_tokens) to FP8. Combined with vLLM hybrid TurboQuant (vllm-project/vllm#39931), the RTX 5090 16K envelope is quality-validated against BF16 — see Production verdict below.

This is still a build/PR-stacked artefact, not a stock-vLLM model yet. It needs two vLLM source overlays: #39931 (hybrid TurboQuant) and the stacked FP8 vocabulary PR #41365, which carries both the FP8 ParallelLMHead opt-in (#41000) and the FP8 VocabParallelEmbedding opt-in in a single diff.

Production verdict (2026-04-30)

This artefact is accepted as the RTX 5090 production candidate for Qwen3.6-27B reviewer workloads inside the measured envelope:

  • prompt budget: ≤ 10,000 tokens
  • completion / reasoning budget: ≤ 6,144 generated tokens
  • runtime context: 16k
  • KV cache: turboquant_k8v4
  • required vLLM overlays: vllm-project/vllm#39931 and stacked vllm-project/vllm#41365

Quality gate summary: hard_gate_34 passed with 34/34 FP8 rows, zero transport or parser errors, legal agreement 0.833, harmful precision/recall 1.0/0.75, and wrong-link false positives 0.0. Silver200 fit-subset quality matched the BF16 reference envelope: 0.8852 agreement vs. BF16 0.88, harmful precision/recall 0.75/0.75.

Coverage caveat: this is not a full BF16 long-context replacement. Silver200 had 16/200 prompt-cap skips above the 10k prompt budget. Requests above that budget should route to the RTX 6000 Pro BF16 long-context path.

Canonical decision record: lexhub/docs/decisions/adr-0129-qwen36-27b-fp8-16k-rtx5090-envelope.md.

What changed

Derivative of Qwen/Qwen3.6-27B-FP8 plus the existing lm_head FP8 artifact inferRouter/Qwen3.6-27B-FP8-lmhead-fp8.

Two BF16 vocabulary tensors are now block-FP8 E4M3 with BF16 per-block scale companions:

Tensor Upstream dtype/size This checkpoint
lm_head.weight BF16, 2.368 GiB F8_E4M3, 1.184 GiB
model.language_model.embed_tokens.weight BF16, 2.368 GiB F8_E4M3, 1.184 GiB

Added companion tensors:

lm_head.weight_scale_inv                           BF16 [1940, 40]
model.language_model.embed_tokens.weight_scale_inv BF16 [1940, 40]

Expected net model-load VRAM reduction vs the original Qwen FP8 checkpoint is about 2.36 GiB total: ~1.18 GiB from lm_head and ~1.18 GiB from embed_tokens.

Local validation

Validated locally on 2026-04-30 on a single RTX 5090 32 GB, TP=1, with the InferRouter vLLM 0.20 overlay image plus the local FP8 embedding patch.

Smoke profile (initial fit verification)

--quantization fp8
--language-model-only
--kv-cache-dtype turboquant_k8v4
--gpu-memory-utilization 0.96
--max-model-len 4029
--max-num-seqs 4
--max-num-batched-tokens 10500
--max-cudagraph-capture-size 4
--block-size 16
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
--enable-chunked-prefill

Observed startup:

Model loading took 26.19 GiB memory
GPU KV cache size: 16,640 tokens
Maximum concurrency for 4,029 tokens per request: 7.00x
post-startup GPU free memory: ~2.59 GiB

Operational smoke:

  • deterministic short prompt: PASS (2+2 je 4.)
  • 4 concurrent short Czech prompts: PASS, no OOM/crash

Production ceiling profile (builder pin)

This is the configuration the InferRouter builder pins as the recommended production envelope for the RTX 5090 32 GB fleet. It pushes mem-util and concurrency beyond the smoke profile while staying within the validated fit envelope.

--quantization fp8
--language-model-only
--kv-cache-dtype turboquant_k8v4
--gpu-memory-utilization 0.98
--max-model-len 4029
--max-num-seqs 8
--max-num-batched-tokens 12288
--max-cudagraph-capture-size 8
--enable-chunked-prefill
--reasoning-parser qwen3
--tool-call-parser qwen3_coder

(--enforce-eager is OFF; CUDA graphs are captured up to the --max-cudagraph-capture-size ceiling above.)

Quality has not been accepted yet. This checkpoint exists so the production builder can create a reproducible image and run the real side-by-side eval against Qwen/Qwen3.6-27B-FP8 and inferRouter/Qwen3.6-27B-FP8-lmhead-fp8.

Required vLLM overlays

This checkpoint is not loadable by stock vLLM 0.20.

Apply exactly two PR overlays, in this order, on top of vllm/vllm-openai:v0.20.0:

  1. #39931 — hybrid TurboQuant support for Qwen3.6/GDN hybrid models.
  2. #41365 — stacked FP8 vocabulary PR. Its pull/41365.diff already contains the #41000 FP8 ParallelLMHead hunks plus the new FP8 VocabParallelEmbedding / embed_tokens hunks. Do not also list #41000 separately — applying #41000 first and then #41365 will fail in patch --forward --batch because the #41000 hunks are already present in the #41365 diff (this is the correct fail-fast behaviour, not silent overwrite).

The checkpoint uses explicit config opt-ins:

{
  "quantization_config": {
    "lm_head": true,
    "embed_tokens": true,
    "embeddings": true
  }
}

Both opt-ins (lm_head and embed_tokens) are required at the config layer even though the engine-side patch is unified — they enable distinct dispatcher branches inside Fp8Config.get_quant_method.

Notes

The FP8 embedding path in #41365 is intentionally memory-first: gathered rows are dequantized on demand. It proves load/runtime correctness and VRAM savings; a future fused embedding kernel would be the performance path.

License

Inherited from upstream Qwen / Apache-2.0. The weights are deterministic mathematical derivatives of the upstream FP8 checkpoint.

Downloads last month
68
Safetensors
Model size
28B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8

Base model

Qwen/Qwen3.6-27B
Quantized
(6)
this model