How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Qwen3.6-35B-A3B-REAM-192 — GGUF quants

GGUF quantizations of keithnull/Qwen3.6-35B-A3B-REAM-192 — the REAM expert-merged variant of Qwen/Qwen3.6-35B-A3B (256 → 192 routed experts, 35.11B → 27.05B params, −23%).

Preliminary HumanEval comparison

⚠️ Read the caveats below before drawing conclusions. This is a relative quant-quality reference, not an absolute capability benchmark. Numbers do not directly compare to published Qwen3.6-35B-A3B HumanEval scores (which use chat template + thinking enabled).

Results (HumanEval pass@1, raw completion, greedy, thinking off)

Model Compression Disk pass@1 ± SE Notes
Qwen3.6-35B-A3B-REAM-192 bf16 (reference) 256→192 experts (REAM merge) ~52 GB 0.7134 0.0354 Our bf16 source; reference for quant degradation
Qwen3.6-35B-A3B-REAM-192 Q4_K_M (this repo) + Q4_K_M quant 16 GB 0.6768 0.0366 Daily-driver candidate
Qwen3.6-35B-A3B-REAM-192 Q3_K_S (this repo) + Q3_K_S quant 11 GB 0.6768 0.0366 Same pass@1 as Q4_K_M; ~30% smaller
REAP-26B Q4_K_M (atbender prune, same 256→192 ratio) 256→192 experts (REAP prune) ~15 GB 0.6646 0.0370 Direct merge-vs-prune comparison at same compression
Bartowski Q4_K_M None + Q4_K_M ~17 GB 0.6463 0.0374 Reference for "vanilla quant of full unmerged model"
Unsloth Qwen3.6-35B-A3B UD-Q4_K_M None + UD-Q4 ~21 GB 0.6280 0.0379 Unsloth's dynamic-quant variant of the full unmerged model

Raw lm-eval output (results JSON + run logs) for all six runs is available in eval-results-2026-05-05/.

Headline reads

  1. Q3_K_S = Q4_K_M on this benchmark. Both passed exactly 111/164 problems. At ~30% smaller disk footprint (11 GB vs 16 GB), Q3_K_S is a viable memory-constrained alternative for code work.
  2. Calibration data likely did real work. Both REAM and REAP used atbender's code-heavy recipe (SWE-smith, xLAM, evol-codealpaca, mix-of-thoughts). Plausible that this concentrated code-relevant capability into the kept 192 experts, partially offsetting the parameter-count loss. A non-code-calibrated REAM/REAP variant would likely score lower on HumanEval but better elsewhere.
  3. REAM Q4_K_M reaches the unmerged-base's quality floor at notably smaller disk — 16 GB vs 17 GB (Bartowski) or 21 GB (Unsloth UD). For memory-constrained users, REAM Q4 (or even Q3) is a quality-preserving size win.

Caveats: confidence intervals overlap substantially (REAM 64.0-71.3% vs REAP 62.8-70.2% vs Bartowski 60.9-68.4%), so all of the above is directional on this single benchmark. Ranking could shift on broader code benchmarks or with multi-seed sampling. 3. bf16 → Q4_K_M loses ~3.7 pp; Q4_K_M → Q3_K_S loses 0 pp on this benchmark.

Methodology (what's being measured)

  • Task: lm-eval-harness humaneval (164 problems, OpenAI 2021)
  • Decoding: greedy (temperature=0), max_gen_toks=1024
  • Mode: raw completion (no --apply_chat_template), thinking off
  • Stop sequences: lm-eval default ('\nclass', '\ndef', '\n#', '\nif', '\nprint')
  • Inference engines: vLLM 0.6+ for bf16, upstream llama-server (b9020-era, CUDA build) for GGUFs
  • Tokenizer: shared (/workspace/REAM-192-bf16 HF dir) across all GGUF runs to ensure consistent tokenization client-side
  • Hardware: RunPod A100 SXM (80 GB VRAM), one model at a time

Why these numbers don't match published Qwen3.6 HumanEval (~85%)

  • No chat template applied. Qwen3.6 is instruct-tuned and benefits substantially from its chat formatting (typically +5-10 pp). We dropped --apply_chat_template because lm-eval 0.4.x's local-completions adapter has a known bug with chat-formatted generate_until requests (sends malformed prompt array → server returns 400). A chat-template-aware variant of this eval is on the follow-up list.
  • Thinking off. Qwen's headline 85% number assumes CoT/thinking enabled. Disabling thinking for this eval costs ~10-15 pp.
  • Default lm-eval task config. The humaneval task's stop sequences and prompt format are designed for base models doing raw code completion. Instruct models like Qwen3.6 lose points by emitting natural-language preamble before the function body that doesn't match these stops.

Caveats — don't over-claim from this single benchmark

  1. One narrow code metric. HumanEval pass@1 is one of many code-quality benchmarks. Broader coverage (BigCodeBench, LiveCodeBench, real agentic tasks like Aider) may show different patterns.
  2. Small sample size. 164 problems means standard error is ±3.5-3.8 pp. Some inter-model gaps are barely outside the confidence-interval overlap.
  3. Single-run, no multi-seed variance. With greedy decoding the run is deterministic, but minor prompt-formatting differences (max length, BOS handling, stop sequences) between engines can introduce noise we haven't measured.
  4. Methodology biases against published numbers. This eval doesn't measure "how good is REAM in real chat usage" — it measures "did the GGUF quants preserve the bf16's behavior in raw-completion mode." Different question, different answer.
  5. Untested in other tasks. No MMLU yet (deferred — needs num_concurrent=8 to be tractable on lm-eval's local-completions). No GSM8K (CoT generation slow). No comparison vs unmerged Qwen3.6 BF16 (separate, larger eval).

What this is good for

  • Choosing between REAM Q4_K_M and Q3_K_S for your own setup: they're equivalent on the HumanEval pass@1 metric, so pick by disk/memory headroom — but with a caveat from local agentic-bench observations. On our lil-quick agent-loop bench (2 standalone code-gen tasks against lil/pi with the daily-driver llama-server stack), Q3_K_S took 26% longer than Q4_K_M with thinking enabled (1810s vs ~1440s total) and produced visibly less thorough project scaffolding — Q4_K_M tended to set up a full TypeScript project (package.json + tsconfig + vitest config + impl + tests, 6 files), while Q3_K_S more often produced just impl + tests (2-4 files) and skipped the surrounding infra. The code itself was usually correct in both cases, but Q3 used more thinking iterations to arrive there. So:

    • Q4_K_M if you can spare the 5 GB and want fuller, faster agent loops.
    • Q3_K_S if you're memory-constrained and don't mind slightly tighter outputs / longer think loops on agentic tasks. Single-shot completion (HumanEval-style "fill in the function") is unaffected.
  • Sanity-checking that the REAM merge didn't damage code reasoning at a level a code agent would notice: it didn't.

  • Comparing quant pipelines (REAM merge → vanilla Q4_K_M vs unmerged base → Unsloth UD-Q4): vanilla Q4 of merged scored higher than UD-Q4 of unmerged here. Surprising; deserves more eval coverage.

Files

File Quant Size Notes
Qwen3.6-35B-A3B-REAM-192-Q4_K_M.gguf Q4_K_M 16 GB Balanced quality/size; recommended starting point
Qwen3.6-35B-A3B-REAM-192-Q3_K_S.gguf Q3_K_S 11 GB (3.52 BPW) Smaller-disk fallback; ~22% of bf16 reference

The bf16 intermediate (~50 GB) is intentionally not uploaded - it is recoverable any time from the source repo + convert_hf_to_gguf.py --outtype bf16. See "Reproducibility" below.

Usage

llama-server (recommended)

./llama-server \
    -m Qwen3.6-35B-A3B-REAM-192-Q4_K_M.gguf \
    -c 65536 \
    --jinja \
    --host 127.0.0.1 --port 8080

OpenAI-compatible chat completions on http://127.0.0.1:8080/v1/chat/completions. To disable Qwen3.6's <think> blocks, pass chat_template_kwargs: {"enable_thinking": false} in the request body.

One-shot via llama-completion

./llama-completion \
    -m Qwen3.6-35B-A3B-REAM-192-Q4_K_M.gguf \
    --jinja \
    --chat-template-kwargs '{"enable_thinking":false}' \
    -p "Your prompt here" \
    -n 500 --temp 0.6

Vision (optional, drop-in)

This is a text-only release. The REAM-merged language model is architecturally identical (same hidden dim, same MoE layout) to the REAP-26B variant, so the existing atbender-derived mmproj from keithnull/Qwen3.6-VL-REAP-26B-A3B-GGUF (mmproj-REAP-26B-F16.gguf) is a drop-in vision projector. Pair it via --mmproj for vision-enabled inference.

Method

  1. Source: bf16 safetensors at keithnull/Qwen3.6-35B-A3B-REAM-192 (52 GB on HF; 27.05B params; vision tower + MTP layer preserved unchanged from upstream Qwen3.6-35B-A3B).
  2. Convert to GGUF with convert_hf_to_gguf.py --outtype bf16never --outtype f16, because Qwen3.6's GDN (Gated Delta Network) linear-attention layers overflow fp16 silently and propagate NaN through any downstream quants.
  3. Quantize via llama-quantize to Q4_K_M and Q3_K_S using llama.cpp upstream HEAD (~2026-05-04).

Reproducibility note

convert_hf_to_gguf.py from current upstream HEAD raises NotImplementedError("BPE pre-tokenizer was not recognized") on this model because the REAM-merged tokenizer's chkhsh (1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f) isn't yet registered. The fix is a single hash entry that maps it to res = "qwen2" — the Qwen2 / 3 / 3.5 / 3.6 family all share the same BPE pre-tokenization regex; only the vocab (and therefore the chkhsh) differ. The tokenizer.json round-trip during REAM's save_pretrained slightly re-orders fields, producing a new fingerprint despite the tokenizer being functionally identical to the upstream Qwen3.6-35B-A3B tokenizer.

To reproduce locally, before the convert call:

import re
PATH_ = "llama.cpp/convert_hf_to_gguf.py"
HASH  = "1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f"
src = open(PATH_).read()
if HASH not in src:
    i  = src.index("def get_vocab_base_pre(")
    j  = src.index("if chkhsh ==", i)
    ls = src.rfind("\n", 0, j) + 1
    ind = src[ls:j]
    patch = (
        f'{ind}if chkhsh == "{HASH}":\n'
        f'{ind}    # Qwen3.6 (REAM-merged) — Qwen2-family BPE pre-tokenization\n'
        f'{ind}    res = "qwen2"\n'
    )
    open(PATH_, "w").write(src[:ls] + patch + src[ls:])
Downloads last month
871
GGUF
Model size
27B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF

Quantized
(1)
this model