--- license: apache-2.0 base_model: Qwen/Qwen3.6-27B-FP8 tags: - qwen3.5 - fp8 - quantization - vllm - turboquant - rtx-5090 - pull-request-artifact language: - en - cs - zh library_name: transformers inference: false --- # Qwen3.6-27B-FP8 with FP8 `lm_head` and FP8 `embed_tokens` > 🚀 **RTX 5090 FP8 production candidate (within envelope).** This checkpoint > keeps the official Qwen3.6-27B-FP8 weight quality path, then quantizes the > two large vocabulary tables (`lm_head` and > `model.language_model.embed_tokens`) to FP8. Combined with vLLM hybrid > TurboQuant ([vllm-project/vllm#39931](https://github.com/vllm-project/vllm/pull/39931)), > the RTX 5090 16K envelope is quality-validated against BF16 — see > **Production verdict** below. > > This is still a build/PR-stacked artefact, not a stock-vLLM model yet. It > needs two vLLM source overlays: > [#39931](https://github.com/vllm-project/vllm/pull/39931) (hybrid TurboQuant) > and the stacked FP8 vocabulary PR > [#41365](https://github.com/vllm-project/vllm/pull/41365), which carries > both the FP8 `ParallelLMHead` opt-in (#41000) and the FP8 > `VocabParallelEmbedding` opt-in in a single diff. ## Production verdict (2026-04-30) This artefact is accepted as the RTX 5090 production candidate for Qwen3.6-27B reviewer workloads inside the measured envelope: - prompt budget: ≤ 10,000 tokens - completion / reasoning budget: ≤ 6,144 generated tokens - runtime context: 16k - KV cache: `turboquant_k8v4` - required vLLM overlays: `vllm-project/vllm#39931` and stacked `vllm-project/vllm#41365` Quality gate summary: `hard_gate_34` passed with 34/34 FP8 rows, zero transport or parser errors, legal agreement 0.833, harmful precision/recall 1.0/0.75, and wrong-link false positives 0.0. Silver200 fit-subset quality matched the BF16 reference envelope: 0.8852 agreement vs. BF16 0.88, harmful precision/recall 0.75/0.75. Coverage caveat: this is **not** a full BF16 long-context replacement. Silver200 had 16/200 prompt-cap skips above the 10k prompt budget. Requests above that budget should route to the RTX 6000 Pro BF16 long-context path. Canonical decision record: `lexhub/docs/decisions/adr-0129-qwen36-27b-fp8-16k-rtx5090-envelope.md`. ## What changed Derivative of [`Qwen/Qwen3.6-27B-FP8`](https://huggingface.co/Qwen/Qwen3.6-27B-FP8) plus the existing `lm_head` FP8 artifact [`inferRouter/Qwen3.6-27B-FP8-lmhead-fp8`](https://huggingface.co/inferRouter/Qwen3.6-27B-FP8-lmhead-fp8). Two BF16 vocabulary tensors are now block-FP8 E4M3 with BF16 per-block scale companions: | Tensor | Upstream dtype/size | This checkpoint | |---|---:|---:| | `lm_head.weight` | BF16, 2.368 GiB | F8_E4M3, 1.184 GiB | | `model.language_model.embed_tokens.weight` | BF16, 2.368 GiB | F8_E4M3, 1.184 GiB | Added companion tensors: ```text lm_head.weight_scale_inv BF16 [1940, 40] model.language_model.embed_tokens.weight_scale_inv BF16 [1940, 40] ``` Expected net model-load VRAM reduction vs the original Qwen FP8 checkpoint is about **2.36 GiB** total: ~1.18 GiB from `lm_head` and ~1.18 GiB from `embed_tokens`. ## Local validation Validated locally on 2026-04-30 on a single RTX 5090 32 GB, TP=1, with the InferRouter vLLM 0.20 overlay image plus the local FP8 embedding patch. ### Smoke profile (initial fit verification) ```text --quantization fp8 --language-model-only --kv-cache-dtype turboquant_k8v4 --gpu-memory-utilization 0.96 --max-model-len 4029 --max-num-seqs 4 --max-num-batched-tokens 10500 --max-cudagraph-capture-size 4 --block-size 16 --reasoning-parser qwen3 --tool-call-parser qwen3_coder --enable-chunked-prefill ``` Observed startup: ```text Model loading took 26.19 GiB memory GPU KV cache size: 16,640 tokens Maximum concurrency for 4,029 tokens per request: 7.00x post-startup GPU free memory: ~2.59 GiB ``` Operational smoke: - deterministic short prompt: PASS (`2+2 je 4.`) - 4 concurrent short Czech prompts: PASS, no OOM/crash ### Production ceiling profile (builder pin) This is the configuration the InferRouter builder pins as the recommended production envelope for the RTX 5090 32 GB fleet. It pushes mem-util and concurrency beyond the smoke profile while staying within the validated fit envelope. ```text --quantization fp8 --language-model-only --kv-cache-dtype turboquant_k8v4 --gpu-memory-utilization 0.98 --max-model-len 4029 --max-num-seqs 8 --max-num-batched-tokens 12288 --max-cudagraph-capture-size 8 --enable-chunked-prefill --reasoning-parser qwen3 --tool-call-parser qwen3_coder ``` (`--enforce-eager` is OFF; CUDA graphs are captured up to the `--max-cudagraph-capture-size` ceiling above.) Quality has **not** been accepted yet. This checkpoint exists so the production builder can create a reproducible image and run the real side-by-side eval against `Qwen/Qwen3.6-27B-FP8` and [`inferRouter/Qwen3.6-27B-FP8-lmhead-fp8`](https://huggingface.co/inferRouter/Qwen3.6-27B-FP8-lmhead-fp8). ## Required vLLM overlays This checkpoint is not loadable by stock vLLM 0.20. Apply exactly **two** PR overlays, in this order, on top of `vllm/vllm-openai:v0.20.0`: 1. [#39931](https://github.com/vllm-project/vllm/pull/39931) — hybrid TurboQuant support for Qwen3.6/GDN hybrid models. 2. [#41365](https://github.com/vllm-project/vllm/pull/41365) — stacked FP8 vocabulary PR. Its `pull/41365.diff` already contains the [#41000](https://github.com/vllm-project/vllm/pull/41000) FP8 `ParallelLMHead` hunks plus the new FP8 `VocabParallelEmbedding` / `embed_tokens` hunks. Do **not** also list `#41000` separately — applying `#41000` first and then `#41365` will fail in `patch --forward --batch` because the `#41000` hunks are already present in the `#41365` diff (this is the correct fail-fast behaviour, not silent overwrite). The checkpoint uses explicit config opt-ins: ```json { "quantization_config": { "lm_head": true, "embed_tokens": true, "embeddings": true } } ``` Both opt-ins (`lm_head` and `embed_tokens`) are required at the config layer even though the engine-side patch is unified — they enable distinct dispatcher branches inside `Fp8Config.get_quant_method`. ## Notes The FP8 embedding path in #41365 is intentionally memory-first: gathered rows are dequantized on demand. It proves load/runtime correctness and VRAM savings; a future fused embedding kernel would be the performance path. ## License Inherited from upstream Qwen / Apache-2.0. The weights are deterministic mathematical derivatives of the upstream FP8 checkpoint.