Instructions to use btbtyler09/Qwen3.6-27B-GPTQ-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use btbtyler09/Qwen3.6-27B-GPTQ-8bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="btbtyler09/Qwen3.6-27B-GPTQ-8bit") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("btbtyler09/Qwen3.6-27B-GPTQ-8bit") model = AutoModelForMultimodalLM.from_pretrained("btbtyler09/Qwen3.6-27B-GPTQ-8bit") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use btbtyler09/Qwen3.6-27B-GPTQ-8bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "btbtyler09/Qwen3.6-27B-GPTQ-8bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "btbtyler09/Qwen3.6-27B-GPTQ-8bit", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/btbtyler09/Qwen3.6-27B-GPTQ-8bit
- SGLang
How to use btbtyler09/Qwen3.6-27B-GPTQ-8bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "btbtyler09/Qwen3.6-27B-GPTQ-8bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "btbtyler09/Qwen3.6-27B-GPTQ-8bit", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "btbtyler09/Qwen3.6-27B-GPTQ-8bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "btbtyler09/Qwen3.6-27B-GPTQ-8bit", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use btbtyler09/Qwen3.6-27B-GPTQ-8bit with Docker Model Runner:
docker model run hf.co/btbtyler09/Qwen3.6-27B-GPTQ-8bit
Qwen3.6-27B GPTQ 8-bit
GPTQ 8-bit quantization of Qwen/Qwen3.6-27B, a 27B-parameter dense multimodal model.
Includes the full vision encoder and MTP (Multi-Token Prediction) module for image understanding and speculative decoding support.
Model Overview
- Architecture: Qwen3_5ForConditionalGeneration (multimodal: text + vision; dense sibling of
qwen3_5_moe) - Total parameters: ~27B
- Layers: 64 (48 linear-attention + 16 full-attention, repeating 3:1 pattern)
- Hidden size: 5120, intermediate size: 17408 (dense MLP — no MoE)
- Context length: 262,144 tokens
- Vision encoder: 27-block ViT, BF16 (333 tensors)
- MTP module: 1-layer speculative decoding head, BF16 (15 tensors)
Quantization Details
All quantizable Linear modules in the text decoder are quantized to INT8 using GPTQ. The vision encoder, MTP module, norms, embeddings, and LM head remain at BF16/FP16 for quality preservation.
| Component | Precision | Notes |
|---|---|---|
mlp.{gate_proj, up_proj, down_proj} |
INT8 (GPTQ) | All 64 layers |
self_attn.{q,k,v,o}_proj |
INT8 (GPTQ) | 16 full-attention layers |
linear_attn.{in_proj_qkv, in_proj_z, out_proj} |
INT8 (GPTQ) | 48 linear-attention layers (GatedDeltaNet) |
linear_attn.{in_proj_a, in_proj_b} |
FP16 | Tiny projections, kept at full precision |
Vision encoder (model.visual.*) |
BF16 | 333 tensors, full precision |
MTP module (mtp.*) |
BF16 | 15 tensors, full precision |
| Embeddings, LM head, norms | FP16/BF16 | Full precision |
GPTQ configuration:
- Bits: 8
- Group size: 32
- Symmetric: Yes
- desc_act: No
- true_sequential: Yes
- act_group_aware: Yes
Calibration
- Dataset: Mixed — evol-codealpaca-v1 (code) + C4 (general text)
- Samples: 512, binned uniformly across context lengths 256–2048
- Quantizer: GPTQModel v5.7.1
Model Size
| Version | Size | Compression |
|---|---|---|
| BF16 (original) | ~50 GB | — |
| GPTQ 8-bit | 32 GB | 1.6× |
| GPTQ 4-bit (FOEM) | 21 GB | 2.4× |
Perplexity
Evaluated on wikitext-2-raw-v1 (test set), seq_len=2048, stride=512:
| Model | Perplexity | Degradation |
|---|---|---|
| BF16 (original) | 7.0652 | — |
| GPTQ 8-bit (this model) | 7.0697 | +0.07% (effectively lossless) |
| GPTQ 4-bit (FOEM) | 7.2032 | +1.95% |
Usage
vLLM (Recommended for Serving)
vllm serve btbtyler09/Qwen3.6-27B-GPTQ-8bit \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--max-model-len 262144 \
--dtype float16 \
--skip-mm-profiling \
--limit-mm-per-prompt '{"image": 2}'
| Parameter | Description |
|---|---|
--tensor-parallel-size 4 |
Shard across 4 GPUs (adjust to your setup) |
--gpu-memory-utilization 0.95 |
Use 95% of GPU VRAM for KV cache + weights |
--max-model-len 262144 |
Full 256K context window support |
--dtype float16 |
Run in FP16 (required for ROCm GPTQ kernels) |
--skip-mm-profiling |
Skip multimodal memory profiling at startup |
--limit-mm-per-prompt '{"image": 2}' |
Allow up to 2 images per request |
vLLM bug workaround (may apply): Up through at least vLLM 0.19.x,
Qwen3_5TextConfigdefinesignore_keys_at_rope_validationas alistinstead of aset, causing aTypeErrorduring config parsing. Apply this patch before serving if you hit the error:python3 -c " for f in [ '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5.py', '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5_moe.py', ]: t = open(f).read() t = t.replace( 'ignore_keys_at_rope_validation\"] = [\n \"mrope_section\",\n \"mrope_interleaved\",\n ]', 'ignore_keys_at_rope_validation\"] = {\n \"mrope_section\",\n \"mrope_interleaved\",\n }') open(f,'w').write(t) print('Patched', f) "
Vision Example (via OpenAI API)
import base64, requests
with open("image.png", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
response = requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "btbtyler09/Qwen3.6-27B-GPTQ-8bit",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
{"type": "text", "text": "Describe what you see in this image."},
]}],
"max_tokens": 1024,
})
print(response.json()["choices"][0]["message"]["content"])
GPTQModel / transformers
Note: Neither GPTQModel nor transformers can currently load this model directly. GPTQModel's text-only loader expects the
model.layers.*weight prefix; this checkpoint uses the multimodal layout withmodel.language_model.layers.*so vision and MTP weights round-trip cleanly. Use vLLM for inference.
Technical Notes
Qwen3.6-27B is a dense multimodal model — it shares the Qwen3_5ForConditionalGeneration wrapper with the MoE-based Qwen3.6-35B-A3B but uses a standard dense MLP in every decoder layer instead of an expert mixture. The text decoder alternates 3 linear-attention (GatedDeltaNet) layers with 1 full-attention layer, repeated 16 times for 64 total layers.
The vision encoder (27-block ViT) and MTP speculative decoding module are preserved at full BF16 precision from the original model. Only the text decoder's quantizable Linear modules are converted to INT8.
Quantized using a small custom GPTQModel definition (Qwen3_5GPTQ, mirror of Qwen3_5MoeGPTQ with the MoE block replaced by a dense MLP) registered under model_type=qwen3_5.
Credits
- Base Model: Qwen — Qwen3.6-27B
- Quantization: GPTQ via GPTQModel v5.7.1
- Quantized by: btbtyler09
License
This model inherits the Apache 2.0 license from the base model.
- Downloads last month
- 7,959
Model tree for btbtyler09/Qwen3.6-27B-GPTQ-8bit
Base model
Qwen/Qwen3.6-27B
Install from pip and serve model
# Install vLLM from pip: pip install vllm# Start the vLLM server: vllm serve "btbtyler09/Qwen3.6-27B-GPTQ-8bit"# Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "btbtyler09/Qwen3.6-27B-GPTQ-8bit", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'