Instructions to use atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16") model = AutoModelForMultimodalLM.from_pretrained("atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16
- SGLang
How to use atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16 with Docker Model Runner:
docker model run hf.co/atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16
Qwen3.6-VL-REAP-26B-A3B-W4A16
First REAP+W4A16 release of Qwen/Qwen3.6-35B-A3B, retaining the full vision-language encoder, optimized for agentic coding and tool use on 24 GB consumer GPUs.
35B → 27B (REAP 25% prune) → W4A16 (INT4) | ~3B active per token
For the BF16 pruned checkpoint (pre-quantization), see atbender/Qwen3.6-VL-REAP-26B-A3B.
Why This Exists
Qwen3.6-35B-A3B is the best open MoE for agentic coding, but at 67 GB BF16 it doesn't fit a single consumer GPU. Plain quantization (AWQ/GPTQ) gets it down to ~18 GB but doesn't reduce the 256 experts worth of routing overhead. REAP prunes the least-activated 25% of experts before quantizing, giving a smaller, faster model with minimal quality loss on the target workload (coding + tool use).
There are 100+ quantized versions of Qwen3.6-35B-A3B on HF. This is the only one that combines expert pruning + quantization + vision preservation.
Model Specifications
| Property | Original | REAP Pruned | This Model (W4A16) |
|---|---|---|---|
| Total Parameters | ~35B | ~27B | ~27B |
| Active Parameters | ~3B | ~3B | ~3B |
| Experts per Layer | 256 | 192 | 192 |
| Routed per Token | 8 | 8 | 8 |
| Shared Expert | 1/layer | 1/layer | 1/layer |
| Layers | 40 | 40 | 40 |
| Vision Encoder | Yes | Yes | Yes (BF16, unquantized) |
| Precision | BF16 | BF16 | W4A16 (INT4 weights, BF16 activations) |
| Context | 262K | 262K | 262K |
| VRAM (32K ctx) | ~67 GB | ~50 GB | ~14 GB |
Calibration Dataset
Expert pruning quality depends heavily on what data the model sees during saliency measurement. We built a composite calibration set of 24,576 samples at seqlen 16,384 tilted toward Qwen3.6's headline capability: agentic coding and tool use.
| Source | Samples | Why |
|---|---|---|
| SWE-bench/SWE-smith-trajectories (tool split) | 6,144 | Agentic multi-turn. Full SWE-bench trajectories with tool calls, file edits, and test runs. This is the closest proxy to real-world agentic coding — the primary use case we're optimizing for. |
| Salesforce/xlam-function-calling-60k | 6,144 | Single-turn tool calling. Structured function definitions + invocations. Ensures the experts responsible for tool-use formatting survive pruning. |
| theblackcat102/evol-codealpaca-v1 | 4,096 | General coding. Evolved instruction-following across languages and difficulty levels. Breadth coverage so we don't over-specialize on agentic patterns. |
| open-r1/Mixture-of-Thoughts (code) | 2,730 | Code reasoning. Long chain-of-thought traces for programming problems. Preserves the model's ability to reason step-by-step through code. |
| open-r1/Mixture-of-Thoughts (math) | 2,730 | Math reasoning. Ensures pruning doesn't disproportionately kill the experts activated during mathematical reasoning — a known risk with code-only calibration. |
| open-r1/Mixture-of-Thoughts (science) | 2,732 | Science reasoning. Same rationale as math — broader domain coverage keeps the model general-purpose even though the primary target is coding. |
Design rationale: 50% of samples (12,288) are directly coding/tool-use. The other 50% are reasoning across domains. This split is intentional — at 25% expert pruning, we're only removing the tail of the saliency distribution. The diverse calibration ensures we're measuring true "least-useful" experts, not just "least-useful for code." Prior work at 50% pruning with pile-10k calibration showed degradation on tool-use tasks specifically because the calibration didn't exercise tool-calling experts enough.
Quantization Details
| Property | Value |
|---|---|
| Method | AutoRound (Intel) |
| Bits | 4 (INT4 weights, BF16 activations) |
| Group size | 128 |
| Symmetric | Yes (required for Marlin/vLLM fast kernels) |
| Calibration | github-code-clean (AutoRound built-in, code-focused), 512 samples, seqlen 2048 |
| Routers (mlp.gate) | FP16 — never quantized |
| shared_expert_gate | FP16 — never quantized |
| Vision encoder | BF16 — never quantized |
Why github-code-clean for quantization calibration? We align the AutoRound calibration with the REAP calibration's emphasis (code-heavy). AutoRound ships a curated code-focused calibration set (github-code-clean) that's natively tokenized for the framework. The 512 samples at seqlen 2048 gives ~1M calibration tokens — 8x more than our prior pile-10k@64x512 approach.
What's Different from atbender's Qwen3.5 Work
- Full pipeline from scratch — REAP observe → prune → AutoRound quantize → postprocess. Not downstream of any pre-pruned checkpoint.
- 25% prune (not 50%) — 256 → 192 experts. Much safer quality retention at this model scale.
- Agentic-coding calibration — composite dataset tilted toward tool use (see above).
- Stronger quantization calibration — 512 samples at seqlen 2048 on
github-code-clean(vs pile-10k @ 64x512). - Vision encoder preserved — copied from base model, unquantized BF16.
Important: dtype Must Be bfloat16
The GDN (Gated Delta Network) linear attention layers overflow float16 (max 65504). Always use dtype=torch.bfloat16. float16 produces NaN outputs.
Prerequisites
pip install "git+https://github.com/huggingface/transformers.git@main"
pip install "torch>=2.7" --index-url https://download.pytorch.org/whl/cu128 # or cu126/cu121
pip install "auto-round>=0.12" accelerate torchvision
# Optional but recommended (10x faster GDN linear attention):
pip install flash-linear-attention causal-conv1d einops
The auto-round package is required — it installs the quantized-linear kernels that the loader needs to dequantize INT4 weights on the fly.
Usage
Text only (load as CausalLM)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
dtype=torch.bfloat16, # MUST be bfloat16 (GDN overflows fp16)
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16", trust_remote_code=True)
messages = [{"role": "user", "content": "Write a Python function to sort a list of dicts by key."}]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True,
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=1024, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
Vision-language (load as ConditionalGeneration)
The model architecture is Qwen3_5MoeForConditionalGeneration (language model + vision encoder). The vision tower is kept at BF16, only the language model was quantized. Load with AutoModelForImageTextToText:
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
import torch
processor = AutoProcessor.from_pretrained("atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16", trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
"atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
image = Image.open("path/to/image.jpg")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image."},
],
}]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(processor.tokenizer.decode(
outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
vLLM (recommended for serving)
Tested with vLLM ≥ 0.19.1 (vllm/vllm-openai:latest, image digest sha256:2622f38a… pulled 2026-04-19). The image registers Qwen3_5MoeForConditionalGeneration natively and consumes the auto_round quantization config — no custom patches, weight rewrites, or runtime hacks required.
Serve
docker run --gpus all --rm -p 8000:8000 \\
-v ~/.cache/huggingface:/root/.cache/huggingface \\
vllm/vllm-openai:latest \\
atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16 \\
--tensor-parallel-size 1 \\
--max-model-len 32768 \\
--quantization auto_round \\
--dtype bfloat16 \\
--trust-remote-code \\
--reasoning-parser qwen3 \\
--enable-auto-tool-choice \\
--tool-call-parser qwen3_coder
Startup: weights load in 5 s, engine init (profile + KV cache + warmup) takes ~2 min. Peak VRAM on load: **15 GiB** at 32K context.
Flag reference
| Flag | Why it's needed |
|---|---|
--quantization auto_round |
INT4 dequant path. Reads per-layer extra_config from the checkpoint. |
--dtype bfloat16 |
Required. GDN linear attention overflows fp16 → NaN. |
--trust-remote-code |
qwen3_5_moe uses custom modeling code. |
--reasoning-parser qwen3 |
Routes <think>…</think> into choices[0].message.reasoning instead of .content. |
--enable-auto-tool-choice + --tool-call-parser qwen3_coder |
Extracts structured tool_calls from Qwen's native tool-call format. |
--max-model-len 32768 |
Safe default for a 24 GB card. Model supports up to 262,144 on ≥ 80 GB hardware. |
Client usage (OpenAI-compatible API)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
# Text + chain-of-thought
r = client.chat.completions.create(
model="atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
messages=[{"role": "user", "content": "What is 17 * 23?"}],
max_tokens=1024,
)
print(r.choices[0].message.reasoning) # the <think> block
print(r.choices[0].message.content) # the final answer
Vision (base64 data URL or remote HTTPS URL):
import base64
with open("my_image.jpg", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
r = client.chat.completions.create(
model="atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
messages=[{
"role": "user",
"content": [
{"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
{"type": "text", "text": "What is in this picture?"},
],
}],
max_tokens=512,
)
print(r.choices[0].message.content)
Tool calling (structured extraction via qwen3_coder parser):
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
r = client.chat.completions.create(
model="atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
messages=[{"role": "user", "content": "Weather in Tokyo?"}],
tools=tools,
)
tc = r.choices[0].message.tool_calls[0]
print(tc.function.name, tc.function.arguments)
Verified end-to-end via the vLLM OpenAI API
| Test | Result |
|---|---|
| Math reasoning | 17 × 23 → 391 with CoT in message.reasoning |
| Code generation | Emits def is_palindrome(s: str) -> bool: return s == s[::-1] |
| Vision (simple) | Correctly identifies "red circle, black outline, white background" |
| Vision (compound) | Correctly counts 3 colored squares + names red/green/blue + left/middle/right |
| Tool calling | Structured tool_calls[0] = get_weather({"location": "Tokyo"}) |
Other inference stacks
- SGLang: Track upstream — uses its own arch registry, may lag.
- llama.cpp / GGUF: Not directly loadable (GGUF is a separate quantization format). You can re-quantize the BF16 pruned sibling atbender/Qwen3.6-VL-REAP-26B-A3B to GGUF with your own pipeline.
- MLX (Apple Silicon): The BF16 pruned sibling is a better starting point — MLX ships its own quantizer.
Hardware
- Built on: Single NVIDIA RTX Pro 6000 (Blackwell, 96 GB VRAM). End-to-end ~15h (13.4h pruning + 1.2h quantization + postprocessing).
- Runs on: Any 24 GB+ GPU (RTX 3090/4090/5090) at 32K context — peak ~15 GiB VRAM on load, leaving room for KV cache. 48 GB+ (A6000, Pro 6000, H100) for full 262K context.
Limitations
- Vision encoder is preserved structurally but was not re-calibrated post-prune. VL quality may differ from the base model.
- Calibration was English + code-heavy. Long-tail languages may degrade more than coding tasks.
- 25% expert pruning removes the least-used experts; rare-domain performance may be affected.
- No post-prune fine-tuning — this is a pure compression artifact.
Recipe
Full reproduction pipeline: reap_prune.py, quantize_autoround.py, postprocess.py. See reap_metadata.json for exact calibration stats and per-layer expert selections.
Credits
- Cerebras Research — REAP method
- Intel — AutoRound quantization
- Qwen Team (Alibaba) — Base model
- OpenMOSE & 0xSero — Prior art on the REAP+quantization recipe
Citation
@article{lasby2025reap,
title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
author={Lasby, Mike and others},
year={2025},
url={https://github.com/CerebrasResearch/reap}
}
@misc{autoround2024,
title={AutoRound: Advanced Weight Quantization},
author={Intel Corporation},
year={2024},
howpublished={\url{https://github.com/intel/auto-round}}
}
- Downloads last month
- 5,157