--- base_model: keithnull/Qwen3.6-35B-A3B-REAM-192 base_model_relation: quantized library_name: gguf license: apache-2.0 tags: - gguf - llama.cpp - quantized - qwen3 - qwen3.6 - ream - moe - expert-merging pipeline_tag: text-generation --- # Qwen3.6-35B-A3B-REAM-192 — GGUF quants GGUF quantizations of [`keithnull/Qwen3.6-35B-A3B-REAM-192`](https://huggingface.co/keithnull/Qwen3.6-35B-A3B-REAM-192) — the REAM expert-merged variant of `Qwen/Qwen3.6-35B-A3B` (256 → 192 routed experts, 35.11B → 27.05B params, −23%). ## Preliminary HumanEval comparison > ⚠️ **Read the caveats below before drawing conclusions.** This is a *relative* quant-quality reference, not an absolute capability benchmark. Numbers do not directly compare to published Qwen3.6-35B-A3B HumanEval scores (which use chat template + thinking enabled). ### Results (HumanEval pass@1, raw completion, greedy, thinking off) | Model | Compression | Disk | pass@1 | ± SE | Notes | |---|---|---:|---:|---:|---| | **Qwen3.6-35B-A3B-REAM-192 bf16** (reference) | 256→192 experts (REAM merge) | ~52 GB | **0.7134** | 0.0354 | Our bf16 source; reference for quant degradation | | **Qwen3.6-35B-A3B-REAM-192 Q4_K_M** *(this repo)* | + Q4_K_M quant | **16 GB** | **0.6768** | 0.0366 | Daily-driver candidate | | **Qwen3.6-35B-A3B-REAM-192 Q3_K_S** *(this repo)* | + Q3_K_S quant | **11 GB** | **0.6768** | 0.0366 | Same pass@1 as Q4_K_M; ~30% smaller | | **REAP-26B Q4_K_M** ([atbender prune](https://huggingface.co/atbender/Qwen3.6-VL-REAP-26B-A3B), same 256→192 ratio) | 256→192 experts (REAP prune) | ~15 GB | **0.6646** | 0.0370 | Direct merge-vs-prune comparison at same compression | | **Bartowski Q4_K_M** | None + Q4_K_M | ~17 GB | **0.6463** | 0.0374 | Reference for "vanilla quant of full unmerged model" | | Unsloth Qwen3.6-35B-A3B UD-Q4_K_M | None + UD-Q4 | ~21 GB | 0.6280 | 0.0379 | Unsloth's dynamic-quant variant of the full unmerged model | Raw lm-eval output (results JSON + run logs) for all six runs is available in eval-results-2026-05-05/. ### Headline reads 1. **Q3_K_S = Q4_K_M on this benchmark.** Both passed exactly 111/164 problems. At ~30% smaller disk footprint (11 GB vs 16 GB), Q3_K_S is a viable memory-constrained alternative for code work. 3. **Calibration data likely did real work.** Both REAM and REAP used [atbender's code-heavy recipe](https://huggingface.co/atbender/Qwen3.6-VL-REAP-26B-A3B) (SWE-smith, xLAM, evol-codealpaca, mix-of-thoughts). Plausible that this concentrated code-relevant capability into the kept 192 experts, partially offsetting the parameter-count loss. A non-code-calibrated REAM/REAP variant would likely score lower on HumanEval but better elsewhere. 4. **REAM Q4_K_M reaches the unmerged-base's quality floor at notably smaller disk** — 16 GB vs 17 GB (Bartowski) or 21 GB (Unsloth UD). For memory-constrained users, REAM Q4 (or even Q3) is a quality-preserving size win. **Caveats**: confidence intervals overlap substantially (REAM 64.0-71.3% vs REAP 62.8-70.2% vs Bartowski 60.9-68.4%), so all of the above is *directional* on this single benchmark. Ranking could shift on broader code benchmarks or with multi-seed sampling. 3. **bf16 → Q4_K_M loses ~3.7 pp**; **Q4_K_M → Q3_K_S loses 0 pp** on this benchmark. ### Methodology (what's being measured) - **Task**: lm-eval-harness `humaneval` (164 problems, OpenAI 2021) - **Decoding**: greedy (`temperature=0`), `max_gen_toks=1024` - **Mode**: **raw completion** (no `--apply_chat_template`), thinking off - **Stop sequences**: lm-eval default (`'\nclass', '\ndef', '\n#', '\nif', '\nprint'`) - **Inference engines**: vLLM 0.6+ for bf16, upstream `llama-server` (b9020-era, CUDA build) for GGUFs - **Tokenizer**: shared (`/workspace/REAM-192-bf16` HF dir) across all GGUF runs to ensure consistent tokenization client-side - **Hardware**: RunPod A100 SXM (80 GB VRAM), one model at a time ### Why these numbers don't match published Qwen3.6 HumanEval (~85%) - **No chat template applied.** Qwen3.6 is instruct-tuned and benefits substantially from its chat formatting (typically +5-10 pp). We dropped `--apply_chat_template` because lm-eval 0.4.x's `local-completions` adapter has a known bug with chat-formatted `generate_until` requests (sends malformed `prompt` array → server returns 400). A chat-template-aware variant of this eval is on the follow-up list. - **Thinking off.** Qwen's headline 85% number assumes CoT/thinking enabled. Disabling thinking for this eval costs ~10-15 pp. - **Default lm-eval task config.** The `humaneval` task's stop sequences and prompt format are designed for *base models doing raw code completion*. Instruct models like Qwen3.6 lose points by emitting natural-language preamble before the function body that doesn't match these stops. ### Caveats — don't over-claim from this single benchmark 1. **One narrow code metric.** HumanEval pass@1 is one of many code-quality benchmarks. Broader coverage (BigCodeBench, LiveCodeBench, real agentic tasks like Aider) may show different patterns. 2. **Small sample size.** 164 problems means standard error is ±3.5-3.8 pp. Some inter-model gaps are barely outside the confidence-interval overlap. 3. **Single-run, no multi-seed variance.** With greedy decoding the run is deterministic, but minor prompt-formatting differences (max length, BOS handling, stop sequences) between engines can introduce noise we haven't measured. 4. **Methodology biases against published numbers.** This eval doesn't measure "how good is REAM in real chat usage" — it measures "did the GGUF quants preserve the bf16's behavior in raw-completion mode." Different question, different answer. 5. **Untested in other tasks.** No MMLU yet (deferred — needs `num_concurrent=8` to be tractable on lm-eval's `local-completions`). No GSM8K (CoT generation slow). No comparison vs unmerged Qwen3.6 BF16 (separate, larger eval). ### What this is good for - **Choosing between REAM Q4_K_M and Q3_K_S** for your own setup: they're equivalent on the HumanEval pass@1 metric, so pick by disk/memory headroom — *but with a caveat from local agentic-bench observations*. On our `lil-quick` agent-loop bench (2 standalone code-gen tasks against `lil`/`pi` with the daily-driver `llama-server` stack), **Q3_K_S took ~26% longer than Q4_K_M with thinking enabled (~1810s vs ~1440s total) and produced visibly less thorough project scaffolding** — Q4_K_M tended to set up a full TypeScript project (package.json + tsconfig + vitest config + impl + tests, 6 files), while Q3_K_S more often produced just impl + tests (2-4 files) and skipped the surrounding infra. The code itself was usually correct in both cases, but Q3 used more thinking iterations to arrive there. So: - **Q4_K_M** if you can spare the 5 GB and want fuller, faster agent loops. - **Q3_K_S** if you're memory-constrained and don't mind slightly tighter outputs / longer think loops on agentic tasks. Single-shot completion (HumanEval-style "fill in the function") is unaffected. - **Sanity-checking that the REAM merge didn't damage code reasoning** at a level a code agent would notice: it didn't. - **Comparing quant pipelines** (REAM merge → vanilla Q4_K_M vs unmerged base → Unsloth UD-Q4): vanilla Q4 of merged scored higher than UD-Q4 of unmerged here. Surprising; deserves more eval coverage. - ## Files | File | Quant | Size | Notes | | --- | --- | ---: | --- | | `Qwen3.6-35B-A3B-REAM-192-Q4_K_M.gguf` | Q4_K_M | 16 GB | Balanced quality/size; recommended starting point | | `Qwen3.6-35B-A3B-REAM-192-Q3_K_S.gguf` | Q3_K_S | 11 GB (3.52 BPW) | Smaller-disk fallback; ~22% of bf16 reference | The bf16 intermediate (~50 GB) is intentionally **not** uploaded - it is recoverable any time from the source repo + `convert_hf_to_gguf.py --outtype bf16`. See "Reproducibility" below. ## Usage ### `llama-server` (recommended) ```bash ./llama-server \ -m Qwen3.6-35B-A3B-REAM-192-Q4_K_M.gguf \ -c 65536 \ --jinja \ --host 127.0.0.1 --port 8080 ``` OpenAI-compatible chat completions on `http://127.0.0.1:8080/v1/chat/completions`. To disable Qwen3.6's `` blocks, pass `chat_template_kwargs: {"enable_thinking": false}` in the request body. ### One-shot via `llama-completion` ```bash ./llama-completion \ -m Qwen3.6-35B-A3B-REAM-192-Q4_K_M.gguf \ --jinja \ --chat-template-kwargs '{"enable_thinking":false}' \ -p "Your prompt here" \ -n 500 --temp 0.6 ``` ### Vision (optional, drop-in) This is a text-only release. The REAM-merged language model is architecturally identical (same hidden dim, same MoE layout) to the REAP-26B variant, so the existing atbender-derived mmproj from [`keithnull/Qwen3.6-VL-REAP-26B-A3B-GGUF`](https://huggingface.co/keithnull/Qwen3.6-VL-REAP-26B-A3B-GGUF) (`mmproj-REAP-26B-F16.gguf`) is a drop-in vision projector. Pair it via `--mmproj` for vision-enabled inference. ## Method 1. Source: bf16 safetensors at [`keithnull/Qwen3.6-35B-A3B-REAM-192`](https://huggingface.co/keithnull/Qwen3.6-35B-A3B-REAM-192) (52 GB on HF; 27.05B params; vision tower + MTP layer preserved unchanged from upstream Qwen3.6-35B-A3B). 2. Convert to GGUF with `convert_hf_to_gguf.py --outtype bf16` — **never** `--outtype f16`, because Qwen3.6's GDN (Gated Delta Network) linear-attention layers overflow fp16 silently and propagate NaN through any downstream quants. 3. Quantize via `llama-quantize` to `Q4_K_M` and `Q3_K_S` using llama.cpp upstream HEAD (~2026-05-04). ### Reproducibility note `convert_hf_to_gguf.py` from current upstream HEAD raises `NotImplementedError("BPE pre-tokenizer was not recognized")` on this model because the REAM-merged tokenizer's `chkhsh` (`1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f`) isn't yet registered. The fix is a single hash entry that maps it to `res = "qwen2"` — the Qwen2 / 3 / 3.5 / 3.6 family all share the same BPE pre-tokenization regex; only the vocab (and therefore the chkhsh) differ. The `tokenizer.json` round-trip during REAM's `save_pretrained` slightly re-orders fields, producing a new fingerprint despite the tokenizer being functionally identical to the upstream Qwen3.6-35B-A3B tokenizer. To reproduce locally, before the convert call: ```python import re PATH_ = "llama.cpp/convert_hf_to_gguf.py" HASH = "1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f" src = open(PATH_).read() if HASH not in src: i = src.index("def get_vocab_base_pre(") j = src.index("if chkhsh ==", i) ls = src.rfind("\n", 0, j) + 1 ind = src[ls:j] patch = ( f'{ind}if chkhsh == "{HASH}":\n' f'{ind} # Qwen3.6 (REAM-merged) — Qwen2-family BPE pre-tokenization\n' f'{ind} res = "qwen2"\n' ) open(PATH_, "w").write(src[:ls] + patch + src[ls:]) ```