Qwen3.6-VL-REAP-26B-A3B-W4A16

First REAP+W4A16 release of Qwen/Qwen3.6-35B-A3B, retaining the full vision-language encoder, optimized for agentic coding and tool use on 24 GB consumer GPUs.

35B → 27B (REAP 25% prune) → W4A16 (INT4) | ~3B active per token

For the BF16 pruned checkpoint (pre-quantization), see atbender/Qwen3.6-VL-REAP-26B-A3B.

Why This Exists

Qwen3.6-35B-A3B is the best open MoE for agentic coding, but at 67 GB BF16 it doesn't fit a single consumer GPU. Plain quantization (AWQ/GPTQ) gets it down to ~18 GB but doesn't reduce the 256 experts worth of routing overhead. REAP prunes the least-activated 25% of experts before quantizing, giving a smaller, faster model with minimal quality loss on the target workload (coding + tool use).

There are 100+ quantized versions of Qwen3.6-35B-A3B on HF. This is the only one that combines expert pruning + quantization + vision preservation.

Model Specifications

Property Original REAP Pruned This Model (W4A16)
Total Parameters ~35B ~27B ~27B
Active Parameters ~3B ~3B ~3B
Experts per Layer 256 192 192
Routed per Token 8 8 8
Shared Expert 1/layer 1/layer 1/layer
Layers 40 40 40
Vision Encoder Yes Yes Yes (BF16, unquantized)
Precision BF16 BF16 W4A16 (INT4 weights, BF16 activations)
Context 262K 262K 262K
VRAM (32K ctx) ~67 GB ~50 GB ~14 GB

Calibration Dataset

Expert pruning quality depends heavily on what data the model sees during saliency measurement. We built a composite calibration set of 24,576 samples at seqlen 16,384 tilted toward Qwen3.6's headline capability: agentic coding and tool use.

Source Samples Why
SWE-bench/SWE-smith-trajectories (tool split) 6,144 Agentic multi-turn. Full SWE-bench trajectories with tool calls, file edits, and test runs. This is the closest proxy to real-world agentic coding — the primary use case we're optimizing for.
Salesforce/xlam-function-calling-60k 6,144 Single-turn tool calling. Structured function definitions + invocations. Ensures the experts responsible for tool-use formatting survive pruning.
theblackcat102/evol-codealpaca-v1 4,096 General coding. Evolved instruction-following across languages and difficulty levels. Breadth coverage so we don't over-specialize on agentic patterns.
open-r1/Mixture-of-Thoughts (code) 2,730 Code reasoning. Long chain-of-thought traces for programming problems. Preserves the model's ability to reason step-by-step through code.
open-r1/Mixture-of-Thoughts (math) 2,730 Math reasoning. Ensures pruning doesn't disproportionately kill the experts activated during mathematical reasoning — a known risk with code-only calibration.
open-r1/Mixture-of-Thoughts (science) 2,732 Science reasoning. Same rationale as math — broader domain coverage keeps the model general-purpose even though the primary target is coding.

Design rationale: 50% of samples (12,288) are directly coding/tool-use. The other 50% are reasoning across domains. This split is intentional — at 25% expert pruning, we're only removing the tail of the saliency distribution. The diverse calibration ensures we're measuring true "least-useful" experts, not just "least-useful for code." Prior work at 50% pruning with pile-10k calibration showed degradation on tool-use tasks specifically because the calibration didn't exercise tool-calling experts enough.

Quantization Details

Property Value
Method AutoRound (Intel)
Bits 4 (INT4 weights, BF16 activations)
Group size 128
Symmetric Yes (required for Marlin/vLLM fast kernels)
Calibration github-code-clean (AutoRound built-in, code-focused), 512 samples, seqlen 2048
Routers (mlp.gate) FP16 — never quantized
shared_expert_gate FP16 — never quantized
Vision encoder BF16 — never quantized

Why github-code-clean for quantization calibration? We align the AutoRound calibration with the REAP calibration's emphasis (code-heavy). AutoRound ships a curated code-focused calibration set (github-code-clean) that's natively tokenized for the framework. The 512 samples at seqlen 2048 gives ~1M calibration tokens — 8x more than our prior pile-10k@64x512 approach.

What's Different from atbender's Qwen3.5 Work

  • Full pipeline from scratch — REAP observe → prune → AutoRound quantize → postprocess. Not downstream of any pre-pruned checkpoint.
  • 25% prune (not 50%) — 256 → 192 experts. Much safer quality retention at this model scale.
  • Agentic-coding calibration — composite dataset tilted toward tool use (see above).
  • Stronger quantization calibration — 512 samples at seqlen 2048 on github-code-clean (vs pile-10k @ 64x512).
  • Vision encoder preserved — copied from base model, unquantized BF16.

Important: dtype Must Be bfloat16

The GDN (Gated Delta Network) linear attention layers overflow float16 (max 65504). Always use dtype=torch.bfloat16. float16 produces NaN outputs.

Prerequisites

pip install "git+https://github.com/huggingface/transformers.git@main"
pip install "torch>=2.7" --index-url https://download.pytorch.org/whl/cu128  # or cu126/cu121
pip install "auto-round>=0.12" accelerate torchvision

# Optional but recommended (10x faster GDN linear attention):
pip install flash-linear-attention causal-conv1d einops

The auto-round package is required — it installs the quantized-linear kernels that the loader needs to dequantize INT4 weights on the fly.

Usage

Text only (load as CausalLM)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
    dtype=torch.bfloat16,              # MUST be bfloat16 (GDN overflows fp16)
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16", trust_remote_code=True)

messages = [{"role": "user", "content": "Write a Python function to sort a list of dicts by key."}]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True,
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=1024, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Vision-language (load as ConditionalGeneration)

The model architecture is Qwen3_5MoeForConditionalGeneration (language model + vision encoder). The vision tower is kept at BF16, only the language model was quantized. Load with AutoModelForImageTextToText:

from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
import torch

processor = AutoProcessor.from_pretrained("atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16", trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    "atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

image = Image.open("path/to/image.jpg")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe this image."},
    ],
}]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(processor.tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

vLLM (recommended for serving)

Tested with vLLM ≥ 0.19.1 (vllm/vllm-openai:latest, image digest sha256:2622f38a… pulled 2026-04-19). The image registers Qwen3_5MoeForConditionalGeneration natively and consumes the auto_round quantization config — no custom patches, weight rewrites, or runtime hacks required.

Serve

docker run --gpus all --rm -p 8000:8000 \\
    -v ~/.cache/huggingface:/root/.cache/huggingface \\
    vllm/vllm-openai:latest \\
    atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16 \\
        --tensor-parallel-size 1 \\
        --max-model-len 32768 \\
        --quantization auto_round \\
        --dtype bfloat16 \\
        --trust-remote-code \\
        --reasoning-parser qwen3 \\
        --enable-auto-tool-choice \\
        --tool-call-parser qwen3_coder

Startup: weights load in 5 s, engine init (profile + KV cache + warmup) takes ~2 min. Peak VRAM on load: **15 GiB** at 32K context.

Flag reference

Flag Why it's needed
--quantization auto_round INT4 dequant path. Reads per-layer extra_config from the checkpoint.
--dtype bfloat16 Required. GDN linear attention overflows fp16 → NaN.
--trust-remote-code qwen3_5_moe uses custom modeling code.
--reasoning-parser qwen3 Routes <think>…</think> into choices[0].message.reasoning instead of .content.
--enable-auto-tool-choice + --tool-call-parser qwen3_coder Extracts structured tool_calls from Qwen's native tool-call format.
--max-model-len 32768 Safe default for a 24 GB card. Model supports up to 262,144 on ≥ 80 GB hardware.

Client usage (OpenAI-compatible API)

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# Text + chain-of-thought
r = client.chat.completions.create(
    model="atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
    messages=[{"role": "user", "content": "What is 17 * 23?"}],
    max_tokens=1024,
)
print(r.choices[0].message.reasoning)   # the <think> block
print(r.choices[0].message.content)     # the final answer

Vision (base64 data URL or remote HTTPS URL):

import base64
with open("my_image.jpg", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

r = client.chat.completions.create(
    model="atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url",
             "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
            {"type": "text", "text": "What is in this picture?"},
        ],
    }],
    max_tokens=512,
)
print(r.choices[0].message.content)

Tool calling (structured extraction via qwen3_coder parser):

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]
r = client.chat.completions.create(
    model="atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16",
    messages=[{"role": "user", "content": "Weather in Tokyo?"}],
    tools=tools,
)
tc = r.choices[0].message.tool_calls[0]
print(tc.function.name, tc.function.arguments)

Verified end-to-end via the vLLM OpenAI API

Test Result
Math reasoning 17 × 23 → 391 with CoT in message.reasoning
Code generation Emits def is_palindrome(s: str) -> bool: return s == s[::-1]
Vision (simple) Correctly identifies "red circle, black outline, white background"
Vision (compound) Correctly counts 3 colored squares + names red/green/blue + left/middle/right
Tool calling Structured tool_calls[0] = get_weather({"location": "Tokyo"})

Other inference stacks

  • SGLang: Track upstream — uses its own arch registry, may lag.
  • llama.cpp / GGUF: Not directly loadable (GGUF is a separate quantization format). You can re-quantize the BF16 pruned sibling atbender/Qwen3.6-VL-REAP-26B-A3B to GGUF with your own pipeline.
  • MLX (Apple Silicon): The BF16 pruned sibling is a better starting point — MLX ships its own quantizer.

Hardware

  • Built on: Single NVIDIA RTX Pro 6000 (Blackwell, 96 GB VRAM). End-to-end ~15h (13.4h pruning + 1.2h quantization + postprocessing).
  • Runs on: Any 24 GB+ GPU (RTX 3090/4090/5090) at 32K context — peak ~15 GiB VRAM on load, leaving room for KV cache. 48 GB+ (A6000, Pro 6000, H100) for full 262K context.

Limitations

  • Vision encoder is preserved structurally but was not re-calibrated post-prune. VL quality may differ from the base model.
  • Calibration was English + code-heavy. Long-tail languages may degrade more than coding tasks.
  • 25% expert pruning removes the least-used experts; rare-domain performance may be affected.
  • No post-prune fine-tuning — this is a pure compression artifact.

Recipe

Full reproduction pipeline: reap_prune.py, quantize_autoround.py, postprocess.py. See reap_metadata.json for exact calibration stats and per-layer expert selections.

Credits

Citation

@article{lasby2025reap,
  title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
  author={Lasby, Mike and others},
  year={2025},
  url={https://github.com/CerebrasResearch/reap}
}

@misc{autoround2024,
  title={AutoRound: Advanced Weight Quantization},
  author={Intel Corporation},
  year={2024},
  howpublished={\url{https://github.com/intel/auto-round}}
}
Downloads last month
5,157
Safetensors
Model size
1B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support