Instructions to use inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8") model = AutoModelForImageTextToText.from_pretrained("inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8
- SGLang
How to use inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8 with Docker Model Runner:
docker model run hf.co/inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8
Qwen3.6-27B-FP8 with FP8 lm_head and FP8 embed_tokens
🚀 RTX 5090 FP8 production candidate (within envelope). This checkpoint keeps the official Qwen3.6-27B-FP8 weight quality path, then quantizes the two large vocabulary tables (
lm_headandmodel.language_model.embed_tokens) to FP8. Combined with vLLM hybrid TurboQuant (vllm-project/vllm#39931), the RTX 5090 16K envelope is quality-validated against BF16 — see Production verdict below.This is still a build/PR-stacked artefact, not a stock-vLLM model yet. It needs two vLLM source overlays: #39931 (hybrid TurboQuant) and the stacked FP8 vocabulary PR #41365, which carries both the FP8
ParallelLMHeadopt-in (#41000) and the FP8VocabParallelEmbeddingopt-in in a single diff.
Production verdict (2026-04-30)
This artefact is accepted as the RTX 5090 production candidate for Qwen3.6-27B reviewer workloads inside the measured envelope:
- prompt budget: ≤ 10,000 tokens
- completion / reasoning budget: ≤ 6,144 generated tokens
- runtime context: 16k
- KV cache:
turboquant_k8v4 - required vLLM overlays:
vllm-project/vllm#39931and stackedvllm-project/vllm#41365
Quality gate summary: hard_gate_34 passed with 34/34 FP8 rows, zero
transport or parser errors, legal agreement 0.833, harmful precision/recall
1.0/0.75, and wrong-link false positives 0.0. Silver200 fit-subset quality
matched the BF16 reference envelope: 0.8852 agreement vs. BF16 0.88, harmful
precision/recall 0.75/0.75.
Coverage caveat: this is not a full BF16 long-context replacement. Silver200 had 16/200 prompt-cap skips above the 10k prompt budget. Requests above that budget should route to the RTX 6000 Pro BF16 long-context path.
Canonical decision record:
lexhub/docs/decisions/adr-0129-qwen36-27b-fp8-16k-rtx5090-envelope.md.
What changed
Derivative of Qwen/Qwen3.6-27B-FP8
plus the existing lm_head FP8 artifact
inferRouter/Qwen3.6-27B-FP8-lmhead-fp8.
Two BF16 vocabulary tensors are now block-FP8 E4M3 with BF16 per-block scale companions:
| Tensor | Upstream dtype/size | This checkpoint |
|---|---|---|
lm_head.weight |
BF16, 2.368 GiB | F8_E4M3, 1.184 GiB |
model.language_model.embed_tokens.weight |
BF16, 2.368 GiB | F8_E4M3, 1.184 GiB |
Added companion tensors:
lm_head.weight_scale_inv BF16 [1940, 40]
model.language_model.embed_tokens.weight_scale_inv BF16 [1940, 40]
Expected net model-load VRAM reduction vs the original Qwen FP8 checkpoint is
about 2.36 GiB total: ~1.18 GiB from lm_head and ~1.18 GiB from
embed_tokens.
Local validation
Validated locally on 2026-04-30 on a single RTX 5090 32 GB, TP=1, with the InferRouter vLLM 0.20 overlay image plus the local FP8 embedding patch.
Smoke profile (initial fit verification)
--quantization fp8
--language-model-only
--kv-cache-dtype turboquant_k8v4
--gpu-memory-utilization 0.96
--max-model-len 4029
--max-num-seqs 4
--max-num-batched-tokens 10500
--max-cudagraph-capture-size 4
--block-size 16
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
--enable-chunked-prefill
Observed startup:
Model loading took 26.19 GiB memory
GPU KV cache size: 16,640 tokens
Maximum concurrency for 4,029 tokens per request: 7.00x
post-startup GPU free memory: ~2.59 GiB
Operational smoke:
- deterministic short prompt: PASS (
2+2 je 4.) - 4 concurrent short Czech prompts: PASS, no OOM/crash
Production ceiling profile (builder pin)
This is the configuration the InferRouter builder pins as the recommended production envelope for the RTX 5090 32 GB fleet. It pushes mem-util and concurrency beyond the smoke profile while staying within the validated fit envelope.
--quantization fp8
--language-model-only
--kv-cache-dtype turboquant_k8v4
--gpu-memory-utilization 0.98
--max-model-len 4029
--max-num-seqs 8
--max-num-batched-tokens 12288
--max-cudagraph-capture-size 8
--enable-chunked-prefill
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
(--enforce-eager is OFF; CUDA graphs are captured up to the
--max-cudagraph-capture-size ceiling above.)
Quality has not been accepted yet. This checkpoint exists so the production
builder can create a reproducible image and run the real side-by-side eval
against Qwen/Qwen3.6-27B-FP8 and
inferRouter/Qwen3.6-27B-FP8-lmhead-fp8.
Required vLLM overlays
This checkpoint is not loadable by stock vLLM 0.20.
Apply exactly two PR overlays, in this order, on top of vllm/vllm-openai:v0.20.0:
- #39931 — hybrid TurboQuant support for Qwen3.6/GDN hybrid models.
- #41365 — stacked FP8
vocabulary PR. Its
pull/41365.diffalready contains the #41000 FP8ParallelLMHeadhunks plus the new FP8VocabParallelEmbedding/embed_tokenshunks. Do not also list#41000separately — applying#41000first and then#41365will fail inpatch --forward --batchbecause the#41000hunks are already present in the#41365diff (this is the correct fail-fast behaviour, not silent overwrite).
The checkpoint uses explicit config opt-ins:
{
"quantization_config": {
"lm_head": true,
"embed_tokens": true,
"embeddings": true
}
}
Both opt-ins (lm_head and embed_tokens) are required at the config layer
even though the engine-side patch is unified — they enable distinct dispatcher
branches inside Fp8Config.get_quant_method.
Notes
The FP8 embedding path in #41365 is intentionally memory-first: gathered rows are dequantized on demand. It proves load/runtime correctness and VRAM savings; a future fused embedding kernel would be the performance path.
License
Inherited from upstream Qwen / Apache-2.0. The weights are deterministic mathematical derivatives of the upstream FP8 checkpoint.
- Downloads last month
- 68