Qwen3.6-27B GPTQ 8-bit

GPTQ 8-bit quantization of Qwen/Qwen3.6-27B, a 27B-parameter dense multimodal model.

Includes the full vision encoder and MTP (Multi-Token Prediction) module for image understanding and speculative decoding support.

Model Overview

  • Architecture: Qwen3_5ForConditionalGeneration (multimodal: text + vision; dense sibling of qwen3_5_moe)
  • Total parameters: ~27B
  • Layers: 64 (48 linear-attention + 16 full-attention, repeating 3:1 pattern)
  • Hidden size: 5120, intermediate size: 17408 (dense MLP — no MoE)
  • Context length: 262,144 tokens
  • Vision encoder: 27-block ViT, BF16 (333 tensors)
  • MTP module: 1-layer speculative decoding head, BF16 (15 tensors)

Quantization Details

All quantizable Linear modules in the text decoder are quantized to INT8 using GPTQ. The vision encoder, MTP module, norms, embeddings, and LM head remain at BF16/FP16 for quality preservation.

Component Precision Notes
mlp.{gate_proj, up_proj, down_proj} INT8 (GPTQ) All 64 layers
self_attn.{q,k,v,o}_proj INT8 (GPTQ) 16 full-attention layers
linear_attn.{in_proj_qkv, in_proj_z, out_proj} INT8 (GPTQ) 48 linear-attention layers (GatedDeltaNet)
linear_attn.{in_proj_a, in_proj_b} FP16 Tiny projections, kept at full precision
Vision encoder (model.visual.*) BF16 333 tensors, full precision
MTP module (mtp.*) BF16 15 tensors, full precision
Embeddings, LM head, norms FP16/BF16 Full precision

GPTQ configuration:

  • Bits: 8
  • Group size: 32
  • Symmetric: Yes
  • desc_act: No
  • true_sequential: Yes
  • act_group_aware: Yes

Calibration

  • Dataset: Mixed — evol-codealpaca-v1 (code) + C4 (general text)
  • Samples: 512, binned uniformly across context lengths 256–2048
  • Quantizer: GPTQModel v5.7.1

Model Size

Version Size Compression
BF16 (original) ~50 GB
GPTQ 8-bit 32 GB 1.6×
GPTQ 4-bit (FOEM) 21 GB 2.4×

Perplexity

Evaluated on wikitext-2-raw-v1 (test set), seq_len=2048, stride=512:

Model Perplexity Degradation
BF16 (original) 7.0652
GPTQ 8-bit (this model) 7.0697 +0.07% (effectively lossless)
GPTQ 4-bit (FOEM) 7.2032 +1.95%

Usage

vLLM (Recommended for Serving)

vllm serve btbtyler09/Qwen3.6-27B-GPTQ-8bit \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 262144 \
  --dtype float16 \
  --skip-mm-profiling \
  --limit-mm-per-prompt '{"image": 2}'
Parameter Description
--tensor-parallel-size 4 Shard across 4 GPUs (adjust to your setup)
--gpu-memory-utilization 0.95 Use 95% of GPU VRAM for KV cache + weights
--max-model-len 262144 Full 256K context window support
--dtype float16 Run in FP16 (required for ROCm GPTQ kernels)
--skip-mm-profiling Skip multimodal memory profiling at startup
--limit-mm-per-prompt '{"image": 2}' Allow up to 2 images per request

vLLM bug workaround (may apply): Up through at least vLLM 0.19.x, Qwen3_5TextConfig defines ignore_keys_at_rope_validation as a list instead of a set, causing a TypeError during config parsing. Apply this patch before serving if you hit the error:

python3 -c "
for f in [
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5.py',
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5_moe.py',
]:
    t = open(f).read()
    t = t.replace(
        'ignore_keys_at_rope_validation\"] = [\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        ]',
        'ignore_keys_at_rope_validation\"] = {\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        }')
    open(f,'w').write(t)
    print('Patched', f)
"

Vision Example (via OpenAI API)

import base64, requests

with open("image.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "btbtyler09/Qwen3.6-27B-GPTQ-8bit",
    "messages": [{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
        {"type": "text", "text": "Describe what you see in this image."},
    ]}],
    "max_tokens": 1024,
})
print(response.json()["choices"][0]["message"]["content"])

GPTQModel / transformers

Note: Neither GPTQModel nor transformers can currently load this model directly. GPTQModel's text-only loader expects the model.layers.* weight prefix; this checkpoint uses the multimodal layout with model.language_model.layers.* so vision and MTP weights round-trip cleanly. Use vLLM for inference.

Technical Notes

Qwen3.6-27B is a dense multimodal model — it shares the Qwen3_5ForConditionalGeneration wrapper with the MoE-based Qwen3.6-35B-A3B but uses a standard dense MLP in every decoder layer instead of an expert mixture. The text decoder alternates 3 linear-attention (GatedDeltaNet) layers with 1 full-attention layer, repeated 16 times for 64 total layers.

The vision encoder (27-block ViT) and MTP speculative decoding module are preserved at full BF16 precision from the original model. Only the text decoder's quantizable Linear modules are converted to INT8.

Quantized using a small custom GPTQModel definition (Qwen3_5GPTQ, mirror of Qwen3_5MoeGPTQ with the MoE block replaced by a dense MLP) registered under model_type=qwen3_5.

Credits

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month
7,959
Safetensors
Model size
28B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for btbtyler09/Qwen3.6-27B-GPTQ-8bit

Base model

Qwen/Qwen3.6-27B
Quantized
(471)
this model