Instructions to use btbtyler09/Qwen3.6-27B-GPTQ-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use btbtyler09/Qwen3.6-27B-GPTQ-8bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="btbtyler09/Qwen3.6-27B-GPTQ-8bit")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("btbtyler09/Qwen3.6-27B-GPTQ-8bit")
model = AutoModelForMultimodalLM.from_pretrained("btbtyler09/Qwen3.6-27B-GPTQ-8bit")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use btbtyler09/Qwen3.6-27B-GPTQ-8bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "btbtyler09/Qwen3.6-27B-GPTQ-8bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "btbtyler09/Qwen3.6-27B-GPTQ-8bit",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/btbtyler09/Qwen3.6-27B-GPTQ-8bit

SGLang

How to use btbtyler09/Qwen3.6-27B-GPTQ-8bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "btbtyler09/Qwen3.6-27B-GPTQ-8bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "btbtyler09/Qwen3.6-27B-GPTQ-8bit",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "btbtyler09/Qwen3.6-27B-GPTQ-8bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "btbtyler09/Qwen3.6-27B-GPTQ-8bit",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use btbtyler09/Qwen3.6-27B-GPTQ-8bit with Docker Model Runner:
```
docker model run hf.co/btbtyler09/Qwen3.6-27B-GPTQ-8bit
```

Qwen3.6-27B GPTQ 8-bit

GPTQ 8-bit quantization of Qwen/Qwen3.6-27B, a 27B-parameter dense multimodal model.

Includes the full vision encoder and MTP (Multi-Token Prediction) module for image understanding and speculative decoding support.

Model Overview

Architecture: Qwen3_5ForConditionalGeneration (multimodal: text + vision; dense sibling of qwen3_5_moe)
Total parameters: ~27B
Layers: 64 (48 linear-attention + 16 full-attention, repeating 3:1 pattern)
Hidden size: 5120, intermediate size: 17408 (dense MLP — no MoE)
Context length: 262,144 tokens
Vision encoder: 27-block ViT, BF16 (333 tensors)
MTP module: 1-layer speculative decoding head, BF16 (15 tensors)

Quantization Details

All quantizable Linear modules in the text decoder are quantized to INT8 using GPTQ. The vision encoder, MTP module, norms, embeddings, and LM head remain at BF16/FP16 for quality preservation.

Component	Precision	Notes
`mlp.{gate_proj, up_proj, down_proj}`	INT8 (GPTQ)	All 64 layers
`self_attn.{q,k,v,o}_proj`	INT8 (GPTQ)	16 full-attention layers
`linear_attn.{in_proj_qkv, in_proj_z, out_proj}`	INT8 (GPTQ)	48 linear-attention layers (GatedDeltaNet)
`linear_attn.{in_proj_a, in_proj_b}`	FP16	Tiny projections, kept at full precision
Vision encoder (`model.visual.*`)	BF16	333 tensors, full precision
MTP module (`mtp.*`)	BF16	15 tensors, full precision
Embeddings, LM head, norms	FP16/BF16	Full precision

GPTQ configuration:

Bits: 8
Group size: 32
Symmetric: Yes
desc_act: No
true_sequential: Yes
act_group_aware: Yes

Calibration

Dataset: Mixed — evol-codealpaca-v1 (code) + C4 (general text)
Samples: 512, binned uniformly across context lengths 256–2048
Quantizer: GPTQModel v5.7.1

Model Size

Version	Size	Compression
BF16 (original)	~50 GB	—
GPTQ 8-bit	32 GB	1.6×
GPTQ 4-bit (FOEM)	21 GB	2.4×

Perplexity

Evaluated on wikitext-2-raw-v1 (test set), seq_len=2048, stride=512:

Model	Perplexity	Degradation
BF16 (original)	7.0652	—
GPTQ 8-bit (this model)	7.0697	+0.07% (effectively lossless)
GPTQ 4-bit (FOEM)	7.2032	+1.95%

Usage

vLLM (Recommended for Serving)

vllm serve btbtyler09/Qwen3.6-27B-GPTQ-8bit \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 262144 \
  --dtype float16 \
  --skip-mm-profiling \
  --limit-mm-per-prompt '{"image": 2}'

Parameter	Description
`--tensor-parallel-size 4`	Shard across 4 GPUs (adjust to your setup)
`--gpu-memory-utilization 0.95`	Use 95% of GPU VRAM for KV cache + weights
`--max-model-len 262144`	Full 256K context window support
`--dtype float16`	Run in FP16 (required for ROCm GPTQ kernels)
`--skip-mm-profiling`	Skip multimodal memory profiling at startup
`--limit-mm-per-prompt '{"image": 2}'`	Allow up to 2 images per request

vLLM bug workaround (may apply): Up through at least vLLM 0.19.x, Qwen3_5TextConfig defines ignore_keys_at_rope_validation as a list instead of a set, causing a TypeError during config parsing. Apply this patch before serving if you hit the error:

python3 -c "
for f in [
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5.py',
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5_moe.py',
]:
    t = open(f).read()
    t = t.replace(
        'ignore_keys_at_rope_validation\"] = [\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        ]',
        'ignore_keys_at_rope_validation\"] = {\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        }')
    open(f,'w').write(t)
    print('Patched', f)
"

Vision Example (via OpenAI API)

import base64, requests

with open("image.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "btbtyler09/Qwen3.6-27B-GPTQ-8bit",
    "messages": [{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
        {"type": "text", "text": "Describe what you see in this image."},
    ]}],
    "max_tokens": 1024,
})
print(response.json()["choices"][0]["message"]["content"])

GPTQModel / transformers

Note: Neither GPTQModel nor transformers can currently load this model directly. GPTQModel's text-only loader expects the model.layers.* weight prefix; this checkpoint uses the multimodal layout with model.language_model.layers.* so vision and MTP weights round-trip cleanly. Use vLLM for inference.

Technical Notes

Qwen3.6-27B is a dense multimodal model — it shares the Qwen3_5ForConditionalGeneration wrapper with the MoE-based Qwen3.6-35B-A3B but uses a standard dense MLP in every decoder layer instead of an expert mixture. The text decoder alternates 3 linear-attention (GatedDeltaNet) layers with 1 full-attention layer, repeated 16 times for 64 total layers.

The vision encoder (27-block ViT) and MTP speculative decoding module are preserved at full BF16 precision from the original model. Only the text decoder's quantizable Linear modules are converted to INT8.

Quantized using a small custom GPTQModel definition (Qwen3_5GPTQ, mirror of Qwen3_5MoeGPTQ with the MoE block replaced by a dense MLP) registered under model_type=qwen3_5.

Credits

Base Model: Qwen — Qwen3.6-27B
Quantization: GPTQ via GPTQModel v5.7.1
Quantized by: btbtyler09

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month: 7,959

Safetensors

Model size

28B params

Tensor type

I32

BF16

Model tree for btbtyler09/Qwen3.6-27B-GPTQ-8bit

Base model

Qwen/Qwen3.6-27B

Quantized

(471)

this model