Instructions to use inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8")
model = AutoModelForImageTextToText.from_pretrained("inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8

SGLang

How to use inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8 with Docker Model Runner:
```
docker model run hf.co/inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8
```

Qwen3.6-27B-FP8 with FP8 `lm_head` and FP8 `embed_tokens`

🚀 RTX 5090 FP8 production candidate (within envelope). This checkpoint keeps the official Qwen3.6-27B-FP8 weight quality path, then quantizes the two large vocabulary tables (lm_head and model.language_model.embed_tokens) to FP8. Combined with vLLM hybrid TurboQuant (vllm-project/vllm#39931), the RTX 5090 16K envelope is quality-validated against BF16 — see Production verdict below.

This is still a build/PR-stacked artefact, not a stock-vLLM model yet. It needs two vLLM source overlays: #39931 (hybrid TurboQuant) and the stacked FP8 vocabulary PR #41365, which carries both the FP8 ParallelLMHead opt-in (#41000) and the FP8 VocabParallelEmbedding opt-in in a single diff.

Production verdict (2026-04-30)

This artefact is accepted as the RTX 5090 production candidate for Qwen3.6-27B reviewer workloads inside the measured envelope:

prompt budget: ≤ 10,000 tokens
completion / reasoning budget: ≤ 6,144 generated tokens
runtime context: 16k
KV cache: turboquant_k8v4
required vLLM overlays: vllm-project/vllm#39931 and stacked vllm-project/vllm#41365

Quality gate summary: hard_gate_34 passed with 34/34 FP8 rows, zero transport or parser errors, legal agreement 0.833, harmful precision/recall 1.0/0.75, and wrong-link false positives 0.0. Silver200 fit-subset quality matched the BF16 reference envelope: 0.8852 agreement vs. BF16 0.88, harmful precision/recall 0.75/0.75.

Coverage caveat: this is not a full BF16 long-context replacement. Silver200 had 16/200 prompt-cap skips above the 10k prompt budget. Requests above that budget should route to the RTX 6000 Pro BF16 long-context path.

Canonical decision record: lexhub/docs/decisions/adr-0129-qwen36-27b-fp8-16k-rtx5090-envelope.md.

What changed

Derivative of Qwen/Qwen3.6-27B-FP8 plus the existing lm_head FP8 artifact inferRouter/Qwen3.6-27B-FP8-lmhead-fp8.

Two BF16 vocabulary tensors are now block-FP8 E4M3 with BF16 per-block scale companions:

Tensor	Upstream dtype/size	This checkpoint
`lm_head.weight`	BF16, 2.368 GiB	F8_E4M3, 1.184 GiB
`model.language_model.embed_tokens.weight`	BF16, 2.368 GiB	F8_E4M3, 1.184 GiB

Added companion tensors:

lm_head.weight_scale_inv                           BF16 [1940, 40]
model.language_model.embed_tokens.weight_scale_inv BF16 [1940, 40]

Expected net model-load VRAM reduction vs the original Qwen FP8 checkpoint is about 2.36 GiB total: ~1.18 GiB from lm_head and ~1.18 GiB from embed_tokens.

Local validation

Validated locally on 2026-04-30 on a single RTX 5090 32 GB, TP=1, with the InferRouter vLLM 0.20 overlay image plus the local FP8 embedding patch.

Smoke profile (initial fit verification)

--quantization fp8
--language-model-only
--kv-cache-dtype turboquant_k8v4
--gpu-memory-utilization 0.96
--max-model-len 4029
--max-num-seqs 4
--max-num-batched-tokens 10500
--max-cudagraph-capture-size 4
--block-size 16
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
--enable-chunked-prefill

Observed startup:

Model loading took 26.19 GiB memory
GPU KV cache size: 16,640 tokens
Maximum concurrency for 4,029 tokens per request: 7.00x
post-startup GPU free memory: ~2.59 GiB

Operational smoke:

deterministic short prompt: PASS (2+2 je 4.)
4 concurrent short Czech prompts: PASS, no OOM/crash

Production ceiling profile (builder pin)

This is the configuration the InferRouter builder pins as the recommended production envelope for the RTX 5090 32 GB fleet. It pushes mem-util and concurrency beyond the smoke profile while staying within the validated fit envelope.

--quantization fp8
--language-model-only
--kv-cache-dtype turboquant_k8v4
--gpu-memory-utilization 0.98
--max-model-len 4029
--max-num-seqs 8
--max-num-batched-tokens 12288
--max-cudagraph-capture-size 8
--enable-chunked-prefill
--reasoning-parser qwen3
--tool-call-parser qwen3_coder

(--enforce-eager is OFF; CUDA graphs are captured up to the --max-cudagraph-capture-size ceiling above.)

Quality has not been accepted yet. This checkpoint exists so the production builder can create a reproducible image and run the real side-by-side eval against Qwen/Qwen3.6-27B-FP8 and inferRouter/Qwen3.6-27B-FP8-lmhead-fp8.

Required vLLM overlays

This checkpoint is not loadable by stock vLLM 0.20.

Apply exactly two PR overlays, in this order, on top of vllm/vllm-openai:v0.20.0:

#39931 — hybrid TurboQuant support for Qwen3.6/GDN hybrid models.
#41365 — stacked FP8 vocabulary PR. Its pull/41365.diff already contains the #41000 FP8 ParallelLMHead hunks plus the new FP8 VocabParallelEmbedding / embed_tokens hunks. Do not also list #41000 separately — applying #41000 first and then #41365 will fail in patch --forward --batch because the #41000 hunks are already present in the #41365 diff (this is the correct fail-fast behaviour, not silent overwrite).

The checkpoint uses explicit config opt-ins:

{
  "quantization_config": {
    "lm_head": true,
    "embed_tokens": true,
    "embeddings": true
  }
}

Both opt-ins (lm_head and embed_tokens) are required at the config layer even though the engine-side patch is unified — they enable distinct dispatcher branches inside Fp8Config.get_quant_method.

Notes

The FP8 embedding path in #41365 is intentionally memory-first: gathered rows are dequantized on demand. It proves load/runtime correctness and VRAM savings; a future fused embedding kernel would be the performance path.

License

Inherited from upstream Qwen / Apache-2.0. The weights are deterministic mathematical derivatives of the upstream FP8 checkpoint.

Downloads last month: 68

Safetensors

Model size

28B params

Tensor type

BF16

F8_E4M3

Model tree for inferRouter/Qwen3.6-27B-FP8-lmhead-embed-fp8

Base model

Qwen/Qwen3.6-27B

Quantized

Qwen/Qwen3.6-27B-FP8