Instructions to use atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16")
model = AutoModelForMultimodalLM.from_pretrained("atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16

SGLang

How to use atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16 with Docker Model Runner:
```
docker model run hf.co/atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16
```

Qwen3.6-VL-REAP-26B-A3B-W4A16

First REAP+W4A16 release of Qwen/Qwen3.6-35B-A3B, retaining the full vision-language encoder, optimized for agentic coding and tool use on 24 GB consumer GPUs.

35B → 27B (REAP 25% prune) → W4A16 (INT4) | ~3B active per token

For the BF16 pruned checkpoint (pre-quantization), see atbender/Qwen3.6-VL-REAP-26B-A3B.

Why This Exists

Qwen3.6-35B-A3B is the best open MoE for agentic coding, but at 67 GB BF16 it doesn't fit a single consumer GPU. Plain quantization (AWQ/GPTQ) gets it down to ~18 GB but doesn't reduce the 256 experts worth of routing overhead. REAP prunes the least-activated 25% of experts before quantizing, giving a smaller, faster model with minimal quality loss on the target workload (coding + tool use).

There are 100+ quantized versions of Qwen3.6-35B-A3B on HF. This is the only one that combines expert pruning + quantization + vision preservation.

Model Specifications

Property	Original	REAP Pruned	This Model (W4A16)
Total Parameters	~35B	~27B	~27B
Active Parameters	~3B	~3B	~3B
Experts per Layer	256	192	192
Routed per Token	8	8	8
Shared Expert	1/layer	1/layer	1/layer
Layers	40	40	40
Vision Encoder	Yes	Yes	Yes (BF16, unquantized)
Precision	BF16	BF16	W4A16 (INT4 weights, BF16 activations)
Context	262K	262K	262K
VRAM (32K ctx)	~67 GB	~50 GB	~14 GB

Calibration Dataset

Expert pruning quality depends heavily on what data the model sees during saliency measurement. We built a composite calibration set of 24,576 samples at seqlen 16,384 tilted toward Qwen3.6's headline capability: agentic coding and tool use.

Source	Samples	Why
SWE-bench/SWE-smith-trajectories (tool split)	6,144	Agentic multi-turn. Full SWE-bench trajectories with tool calls, file edits, and test runs. This is the closest proxy to real-world agentic coding — the primary use case we're optimizing for.
Salesforce/xlam-function-calling-60k	6,144	Single-turn tool calling. Structured function definitions + invocations. Ensures the experts responsible for tool-use formatting survive pruning.
theblackcat102/evol-codealpaca-v1	4,096	General coding. Evolved instruction-following across languages and difficulty levels. Breadth coverage so we don't over-specialize on agentic patterns.
open-r1/Mixture-of-Thoughts (code)	2,730	Code reasoning. Long chain-of-thought traces for programming problems. Preserves the model's ability to reason step-by-step through code.
open-r1/Mixture-of-Thoughts (math)	2,730	Math reasoning. Ensures pruning doesn't disproportionately kill the experts activated during mathematical reasoning — a known risk with code-only calibration.
open-r1/Mixture-of-Thoughts (science)	2,732	Science reasoning. Same rationale as math — broader domain coverage keeps the model general-purpose even though the primary target is coding.

Design rationale: 50% of samples (12,288) are directly coding/tool-use. The other 50% are reasoning across domains. This split is intentional — at 25% expert pruning, we're only removing the tail of the saliency distribution. The diverse calibration ensures we're measuring true "least-useful" experts, not just "least-useful for code." Prior work at 50% pruning with pile-10k calibration showed degradation on tool-use tasks specifically because the calibration didn't exercise tool-calling experts enough.

Quantization Details

Property	Value
Method	AutoRound (Intel)
Bits	4 (INT4 weights, BF16 activations)
Group size	128
Symmetric	Yes (required for Marlin/vLLM fast kernels)
Calibration	`github-code-clean` (AutoRound built-in, code-focused), 512 samples, seqlen 2048
Routers (mlp.gate)	FP16 — never quantized
shared_expert_gate	FP16 — never quantized
Vision encoder	BF16 — never quantized

Why github-code-clean for quantization calibration? We align the AutoRound calibration with the REAP calibration's emphasis (code-heavy). AutoRound ships a curated code-focused calibration set (github-code-clean) that's natively tokenized for the framework. The 512 samples at seqlen 2048 gives ~1M calibration tokens — 8x more than our prior pile-10k@64x512 approach.

What's Different from atbender's Qwen3.5 Work

Full pipeline from scratch — REAP observe → prune → AutoRound quantize → postprocess. Not downstream of any pre-pruned checkpoint.
25% prune (not 50%) — 256 → 192 experts. Much safer quality retention at this model scale.
Agentic-coding calibration — composite dataset tilted toward tool use (see above).
Stronger quantization calibration — 512 samples at seqlen 2048 on github-code-clean (vs pile-10k @ 64x512).
Vision encoder preserved — copied from base model, unquantized BF16.

Important: dtype Must Be bfloat16

The GDN (Gated Delta Network) linear attention layers overflow float16 (max 65504). Always use dtype=torch.bfloat16. float16 produces NaN outputs.

Prerequisites

pip install "git+https://github.com/huggingface/transformers.git@main"
pip install "torch>=2.7" --index-url https://download.pytorch.org/whl/cu128  # or cu126/cu121
pip install "auto-round>=0.12" accelerate torchvision

# Optional but recommended (10x faster GDN linear attention):
pip install flash-linear-attention causal-conv1d einops

The auto-round package is required — it installs the quantized-linear kernels that the loader needs to dequantize INT4 weights on the fly.

Usage

Text only (load as CausalLM)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
    dtype=torch.bfloat16,              # MUST be bfloat16 (GDN overflows fp16)
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16", trust_remote_code=True)

messages = [{"role": "user", "content": "Write a Python function to sort a list of dicts by key."}]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True,
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=1024, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Vision-language (load as ConditionalGeneration)

The model architecture is Qwen3_5MoeForConditionalGeneration (language model + vision encoder). The vision tower is kept at BF16, only the language model was quantized. Load with AutoModelForImageTextToText:

from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
import torch

processor = AutoProcessor.from_pretrained("atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16", trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    "atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

image = Image.open("path/to/image.jpg")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe this image."},
    ],
}]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(processor.tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

vLLM (recommended for serving)

Tested with vLLM ≥ 0.19.1 (vllm/vllm-openai:latest, image digest sha256:2622f38a… pulled 2026-04-19). The image registers Qwen3_5MoeForConditionalGeneration natively and consumes the auto_round quantization config — no custom patches, weight rewrites, or runtime hacks required.

Serve

docker run --gpus all --rm -p 8000:8000 \\
    -v ~/.cache/huggingface:/root/.cache/huggingface \\
    vllm/vllm-openai:latest \\
    atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16 \\
        --tensor-parallel-size 1 \\
        --max-model-len 32768 \\
        --quantization auto_round \\
        --dtype bfloat16 \\
        --trust-remote-code \\
        --reasoning-parser qwen3 \\
        --enable-auto-tool-choice \\
        --tool-call-parser qwen3_coder

Startup: weights load in 5 s, engine init (profile + KV cache + warmup) takes ~2 min. Peak VRAM on load: **15 GiB** at 32K context.

Flag reference

Flag	Why it's needed
`--quantization auto_round`	INT4 dequant path. Reads per-layer `extra_config` from the checkpoint.
`--dtype bfloat16`	Required. GDN linear attention overflows fp16 → NaN.
`--trust-remote-code`	qwen3_5_moe uses custom modeling code.
`--reasoning-parser qwen3`	Routes `<think>…</think>` into `choices[0].message.reasoning` instead of `.content`.
`--enable-auto-tool-choice` + `--tool-call-parser qwen3_coder`	Extracts structured `tool_calls` from Qwen's native tool-call format.
`--max-model-len 32768`	Safe default for a 24 GB card. Model supports up to 262,144 on ≥ 80 GB hardware.

Client usage (OpenAI-compatible API)

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# Text + chain-of-thought
r = client.chat.completions.create(
    model="atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
    messages=[{"role": "user", "content": "What is 17 * 23?"}],
    max_tokens=1024,
)
print(r.choices[0].message.reasoning)   # the <think> block
print(r.choices[0].message.content)     # the final answer

Vision (base64 data URL or remote HTTPS URL):

import base64
with open("my_image.jpg", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

r = client.chat.completions.create(
    model="atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url",
             "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
            {"type": "text", "text": "What is in this picture?"},
        ],
    }],
    max_tokens=512,
)
print(r.choices[0].message.content)

Tool calling (structured extraction via qwen3_coder parser):

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]
r = client.chat.completions.create(
    model="atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
    messages=[{"role": "user", "content": "Weather in Tokyo?"}],
    tools=tools,
)
tc = r.choices[0].message.tool_calls[0]
print(tc.function.name, tc.function.arguments)

Verified end-to-end via the vLLM OpenAI API

Test	Result
Math reasoning	`17 × 23 → 391` with CoT in `message.reasoning`
Code generation	Emits `def is_palindrome(s: str) -> bool: return s == s[::-1]`
Vision (simple)	Correctly identifies "red circle, black outline, white background"
Vision (compound)	Correctly counts 3 colored squares + names red/green/blue + left/middle/right
Tool calling	Structured `tool_calls[0] = get_weather({"location": "Tokyo"})`

Other inference stacks

SGLang: Track upstream — uses its own arch registry, may lag.
llama.cpp / GGUF: Not directly loadable (GGUF is a separate quantization format). You can re-quantize the BF16 pruned sibling atbender/Qwen3.6-VL-REAP-26B-A3B to GGUF with your own pipeline.
MLX (Apple Silicon): The BF16 pruned sibling is a better starting point — MLX ships its own quantizer.

Hardware

Built on: Single NVIDIA RTX Pro 6000 (Blackwell, 96 GB VRAM). End-to-end ~15h (13.4h pruning + 1.2h quantization + postprocessing).
Runs on: Any 24 GB+ GPU (RTX 3090/4090/5090) at 32K context — peak ~15 GiB VRAM on load, leaving room for KV cache. 48 GB+ (A6000, Pro 6000, H100) for full 262K context.

Limitations

Vision encoder is preserved structurally but was not re-calibrated post-prune. VL quality may differ from the base model.
Calibration was English + code-heavy. Long-tail languages may degrade more than coding tasks.
25% expert pruning removes the least-used experts; rare-domain performance may be affected.
No post-prune fine-tuning — this is a pure compression artifact.

Recipe

Full reproduction pipeline: reap_prune.py, quantize_autoround.py, postprocess.py. See reap_metadata.json for exact calibration stats and per-layer expert selections.

Credits

Cerebras Research — REAP method
Intel — AutoRound quantization
Qwen Team (Alibaba) — Base model
OpenMOSE & 0xSero — Prior art on the REAP+quantization recipe

Citation

@article{lasby2025reap,
  title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
  author={Lasby, Mike and others},
  year={2025},
  url={https://github.com/CerebrasResearch/reap}
}

@misc{autoround2024,
  title={AutoRound: Advanced Weight Quantization},
  author={Intel Corporation},
  year={2024},
  howpublished={\url{https://github.com/intel/auto-round}}
}

Downloads last month: 5,157

Safetensors

Model size

1B params

Tensor type

I32

BF16

F16