---
license: apache-2.0
base_model: Qwen/Qwen3.6-27B-FP8
tags:
- qwen3.5
- fp8
- quantization
- vllm
- turboquant
- rtx-5090
- pull-request-artifact
language:
- en
- cs
- zh
library_name: transformers
inference: false
---

# Qwen3.6-27B-FP8 with FP8 `lm_head` and FP8 `embed_tokens`

> 🚀 **RTX 5090 FP8 production candidate (within envelope).** This checkpoint
> keeps the official Qwen3.6-27B-FP8 weight quality path, then quantizes the
> two large vocabulary tables (`lm_head` and
> `model.language_model.embed_tokens`) to FP8. Combined with vLLM hybrid
> TurboQuant ([vllm-project/vllm#39931](https://github.com/vllm-project/vllm/pull/39931)),
> the RTX 5090 16K envelope is quality-validated against BF16 — see
> **Production verdict** below.
>
> This is still a build/PR-stacked artefact, not a stock-vLLM model yet. It
> needs two vLLM source overlays:
> [#39931](https://github.com/vllm-project/vllm/pull/39931) (hybrid TurboQuant)
> and the stacked FP8 vocabulary PR
> [#41365](https://github.com/vllm-project/vllm/pull/41365), which carries
> both the FP8 `ParallelLMHead` opt-in (#41000) and the FP8
> `VocabParallelEmbedding` opt-in in a single diff.

## Production verdict (2026-04-30)

This artefact is accepted as the RTX 5090 production candidate for
Qwen3.6-27B reviewer workloads inside the measured envelope:

- prompt budget: ≤ 10,000 tokens
- completion / reasoning budget: ≤ 6,144 generated tokens
- runtime context: 16k
- KV cache: `turboquant_k8v4`
- required vLLM overlays: `vllm-project/vllm#39931` and stacked
  `vllm-project/vllm#41365`

Quality gate summary: `hard_gate_34` passed with 34/34 FP8 rows, zero
transport or parser errors, legal agreement 0.833, harmful precision/recall
1.0/0.75, and wrong-link false positives 0.0. Silver200 fit-subset quality
matched the BF16 reference envelope: 0.8852 agreement vs. BF16 0.88, harmful
precision/recall 0.75/0.75.

Coverage caveat: this is **not** a full BF16 long-context replacement.
Silver200 had 16/200 prompt-cap skips above the 10k prompt budget. Requests
above that budget should route to the RTX 6000 Pro BF16 long-context path.

Canonical decision record:
`lexhub/docs/decisions/adr-0129-qwen36-27b-fp8-16k-rtx5090-envelope.md`.

## What changed

Derivative of [`Qwen/Qwen3.6-27B-FP8`](https://huggingface.co/Qwen/Qwen3.6-27B-FP8)
plus the existing `lm_head` FP8 artifact
[`inferRouter/Qwen3.6-27B-FP8-lmhead-fp8`](https://huggingface.co/inferRouter/Qwen3.6-27B-FP8-lmhead-fp8).

Two BF16 vocabulary tensors are now block-FP8 E4M3 with BF16 per-block scale
companions:

| Tensor | Upstream dtype/size | This checkpoint |
|---|---:|---:|
| `lm_head.weight` | BF16, 2.368 GiB | F8_E4M3, 1.184 GiB |
| `model.language_model.embed_tokens.weight` | BF16, 2.368 GiB | F8_E4M3, 1.184 GiB |

Added companion tensors:

```text
lm_head.weight_scale_inv                           BF16 [1940, 40]
model.language_model.embed_tokens.weight_scale_inv BF16 [1940, 40]
```

Expected net model-load VRAM reduction vs the original Qwen FP8 checkpoint is
about **2.36 GiB** total: ~1.18 GiB from `lm_head` and ~1.18 GiB from
`embed_tokens`.

## Local validation

Validated locally on 2026-04-30 on a single RTX 5090 32 GB, TP=1, with the
InferRouter vLLM 0.20 overlay image plus the local FP8 embedding patch.

### Smoke profile (initial fit verification)

```text
--quantization fp8
--language-model-only
--kv-cache-dtype turboquant_k8v4
--gpu-memory-utilization 0.96
--max-model-len 4029
--max-num-seqs 4
--max-num-batched-tokens 10500
--max-cudagraph-capture-size 4
--block-size 16
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
--enable-chunked-prefill
```

Observed startup:

```text
Model loading took 26.19 GiB memory
GPU KV cache size: 16,640 tokens
Maximum concurrency for 4,029 tokens per request: 7.00x
post-startup GPU free memory: ~2.59 GiB
```

Operational smoke:

- deterministic short prompt: PASS (`2+2 je 4.`)
- 4 concurrent short Czech prompts: PASS, no OOM/crash

### Production ceiling profile (builder pin)

This is the configuration the InferRouter builder pins as the recommended
production envelope for the RTX 5090 32 GB fleet. It pushes mem-util and
concurrency beyond the smoke profile while staying within the validated
fit envelope.

```text
--quantization fp8
--language-model-only
--kv-cache-dtype turboquant_k8v4
--gpu-memory-utilization 0.98
--max-model-len 4029
--max-num-seqs 8
--max-num-batched-tokens 12288
--max-cudagraph-capture-size 8
--enable-chunked-prefill
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
```

(`--enforce-eager` is OFF; CUDA graphs are captured up to the
`--max-cudagraph-capture-size` ceiling above.)

Quality has **not** been accepted yet. This checkpoint exists so the production
builder can create a reproducible image and run the real side-by-side eval
against `Qwen/Qwen3.6-27B-FP8` and
[`inferRouter/Qwen3.6-27B-FP8-lmhead-fp8`](https://huggingface.co/inferRouter/Qwen3.6-27B-FP8-lmhead-fp8).

## Required vLLM overlays

This checkpoint is not loadable by stock vLLM 0.20.

Apply exactly **two** PR overlays, in this order, on top of `vllm/vllm-openai:v0.20.0`:

1. [#39931](https://github.com/vllm-project/vllm/pull/39931) — hybrid TurboQuant
   support for Qwen3.6/GDN hybrid models.
2. [#41365](https://github.com/vllm-project/vllm/pull/41365) — stacked FP8
   vocabulary PR. Its `pull/41365.diff` already contains the
   [#41000](https://github.com/vllm-project/vllm/pull/41000) FP8
   `ParallelLMHead` hunks plus the new FP8 `VocabParallelEmbedding` /
   `embed_tokens` hunks. Do **not** also list `#41000` separately — applying
   `#41000` first and then `#41365` will fail in `patch --forward --batch`
   because the `#41000` hunks are already present in the `#41365` diff
   (this is the correct fail-fast behaviour, not silent overwrite).

The checkpoint uses explicit config opt-ins:

```json
{
  "quantization_config": {
    "lm_head": true,
    "embed_tokens": true,
    "embeddings": true
  }
}
```

Both opt-ins (`lm_head` and `embed_tokens`) are required at the config layer
even though the engine-side patch is unified — they enable distinct dispatcher
branches inside `Fp8Config.get_quant_method`.

## Notes

The FP8 embedding path in #41365 is intentionally memory-first: gathered rows
are dequantized on demand. It proves load/runtime correctness and VRAM savings;
a future fused embedding kernel would be the performance path.

## License

Inherited from upstream Qwen / Apache-2.0. The weights are deterministic
mathematical derivatives of the upstream FP8 checkpoint.