---
base_model: keithnull/Qwen3.6-35B-A3B-REAM-192
base_model_relation: quantized
library_name: gguf
license: apache-2.0
tags:
  - gguf
  - llama.cpp
  - quantized
  - qwen3
  - qwen3.6
  - ream
  - moe
  - expert-merging
pipeline_tag: text-generation
---

# Qwen3.6-35B-A3B-REAM-192 — GGUF quants

GGUF quantizations of [`keithnull/Qwen3.6-35B-A3B-REAM-192`](https://huggingface.co/keithnull/Qwen3.6-35B-A3B-REAM-192) — the REAM expert-merged variant of `Qwen/Qwen3.6-35B-A3B` (256 → 192 routed experts, 35.11B → 27.05B params, −23%).


## Preliminary HumanEval comparison

> ⚠️ **Read the caveats below before drawing conclusions.** This is a *relative* quant-quality reference, not an absolute capability benchmark. Numbers do not directly compare to published Qwen3.6-35B-A3B HumanEval scores (which use chat template + thinking enabled).

### Results (HumanEval pass@1, raw completion, greedy, thinking off)

| Model | Compression | Disk | pass@1 | ± SE | Notes |
|---|---|---:|---:|---:|---|
| **Qwen3.6-35B-A3B-REAM-192 bf16** (reference) | 256→192 experts (REAM merge) | ~52 GB | **0.7134** | 0.0354 | Our bf16 source; reference for quant degradation |
| **Qwen3.6-35B-A3B-REAM-192 Q4_K_M** *(this repo)* | + Q4_K_M quant | **16 GB** | **0.6768** | 0.0366 | Daily-driver candidate |
| **Qwen3.6-35B-A3B-REAM-192 Q3_K_S** *(this repo)* | + Q3_K_S quant | **11 GB** | **0.6768** | 0.0366 | Same pass@1 as Q4_K_M; ~30% smaller |
| **REAP-26B Q4_K_M** ([atbender prune](https://huggingface.co/atbender/Qwen3.6-VL-REAP-26B-A3B), same 256→192 ratio) | 256→192 experts (REAP prune) | ~15 GB | **0.6646** | 0.0370 | Direct merge-vs-prune comparison at same compression |
| **Bartowski Q4_K_M**  | None + Q4_K_M | ~17 GB | **0.6463** | 0.0374 | Reference for "vanilla quant of full unmerged model" |
| Unsloth Qwen3.6-35B-A3B UD-Q4_K_M  | None + UD-Q4 | ~21 GB | 0.6280 | 0.0379 | Unsloth's dynamic-quant variant of the full unmerged model |

Raw lm-eval output (results JSON + run logs) for all six runs is available in eval-results-2026-05-05/.

### Headline reads

1. **Q3_K_S = Q4_K_M on this benchmark.** Both passed exactly 111/164 problems. At ~30% smaller disk footprint (11 GB vs 16 GB), Q3_K_S is a viable memory-constrained alternative for code work.
3. **Calibration data likely did real work.** Both REAM and REAP used [atbender's code-heavy recipe](https://huggingface.co/atbender/Qwen3.6-VL-REAP-26B-A3B) (SWE-smith, xLAM, evol-codealpaca, mix-of-thoughts). Plausible that this concentrated code-relevant capability into the kept 192 experts, partially offsetting the parameter-count loss. A non-code-calibrated REAM/REAP variant would likely score lower on HumanEval but better elsewhere.
4. **REAM Q4_K_M reaches the unmerged-base's quality floor at notably smaller disk** — 16 GB vs 17 GB (Bartowski) or 21 GB (Unsloth UD). For memory-constrained users, REAM Q4 (or even Q3) is a quality-preserving size win.

**Caveats**: confidence intervals overlap substantially (REAM 64.0-71.3% vs REAP 62.8-70.2% vs Bartowski 60.9-68.4%), so all of the above is *directional* on this single benchmark. Ranking could shift on broader code benchmarks or with multi-seed sampling.
3. **bf16 → Q4_K_M loses ~3.7 pp**; **Q4_K_M → Q3_K_S loses 0 pp** on this benchmark.

### Methodology (what's being measured)

- **Task**: lm-eval-harness `humaneval` (164 problems, OpenAI 2021)
- **Decoding**: greedy (`temperature=0`), `max_gen_toks=1024`
- **Mode**: **raw completion** (no `--apply_chat_template`), thinking off
- **Stop sequences**: lm-eval default (`'\nclass', '\ndef', '\n#', '\nif', '\nprint'`)
- **Inference engines**: vLLM 0.6+ for bf16, upstream `llama-server` (b9020-era, CUDA build) for GGUFs
- **Tokenizer**: shared (`/workspace/REAM-192-bf16` HF dir) across all GGUF runs to ensure consistent tokenization client-side
- **Hardware**: RunPod A100 SXM (80 GB VRAM), one model at a time

### Why these numbers don't match published Qwen3.6 HumanEval (~85%)

- **No chat template applied.** Qwen3.6 is instruct-tuned and benefits substantially from its chat formatting (typically +5-10 pp). We dropped `--apply_chat_template` because lm-eval 0.4.x's `local-completions` adapter has a known bug with chat-formatted `generate_until` requests (sends malformed `prompt` array → server returns 400). A chat-template-aware variant of this eval is on the follow-up list.
- **Thinking off.** Qwen's headline 85% number assumes CoT/thinking enabled. Disabling thinking for this eval costs ~10-15 pp.
- **Default lm-eval task config.** The `humaneval` task's stop sequences and prompt format are designed for *base models doing raw code completion*. Instruct models like Qwen3.6 lose points by emitting natural-language preamble before the function body that doesn't match these stops.

### Caveats — don't over-claim from this single benchmark

1. **One narrow code metric.** HumanEval pass@1 is one of many code-quality benchmarks. Broader coverage (BigCodeBench, LiveCodeBench, real agentic tasks like Aider) may show different patterns.
2. **Small sample size.** 164 problems means standard error is ±3.5-3.8 pp. Some inter-model gaps are barely outside the confidence-interval overlap.
3. **Single-run, no multi-seed variance.** With greedy decoding the run is deterministic, but minor prompt-formatting differences (max length, BOS handling, stop sequences) between engines can introduce noise we haven't measured.
4. **Methodology biases against published numbers.** This eval doesn't measure "how good is REAM in real chat usage" — it measures "did the GGUF quants preserve the bf16's behavior in raw-completion mode." Different question, different answer.
5. **Untested in other tasks.** No MMLU yet (deferred — needs `num_concurrent=8` to be tractable on lm-eval's `local-completions`). No GSM8K (CoT generation slow). No comparison vs unmerged Qwen3.6 BF16 (separate, larger eval).

### What this is good for

- **Choosing between REAM Q4_K_M and Q3_K_S** for your own setup: they're equivalent on the HumanEval pass@1 metric, so pick by disk/memory headroom — *but with a caveat from local agentic-bench observations*. On our `lil-quick` agent-loop bench (2 standalone code-gen tasks against `lil`/`pi` with the daily-driver `llama-server` stack), **Q3_K_S took ~26% longer than Q4_K_M with thinking enabled (~1810s vs ~1440s total) and produced visibly less thorough project scaffolding** — Q4_K_M tended to set up a full TypeScript project (package.json + tsconfig + vitest config + impl + tests, 6 files), while Q3_K_S more often produced just impl + tests (2-4 files) and skipped the surrounding infra. The code itself was usually correct in both cases, but Q3 used more thinking iterations to arrive there. So:
  - **Q4_K_M** if you can spare the 5 GB and want fuller, faster agent loops.
  - **Q3_K_S** if you're memory-constrained and don't mind slightly tighter outputs / longer think loops on agentic tasks. Single-shot completion (HumanEval-style "fill in the function") is unaffected.

- **Sanity-checking that the REAM merge didn't damage code reasoning** at a level a code agent would notice: it didn't.
- **Comparing quant pipelines** (REAM merge → vanilla Q4_K_M vs unmerged base → Unsloth UD-Q4): vanilla Q4 of merged scored higher than UD-Q4 of unmerged here. Surprising; deserves more eval coverage.
- 
## Files

| File | Quant | Size | Notes |
| --- | --- | ---: | --- |
| `Qwen3.6-35B-A3B-REAM-192-Q4_K_M.gguf` | Q4_K_M | 16 GB | Balanced quality/size; recommended starting point |
| `Qwen3.6-35B-A3B-REAM-192-Q3_K_S.gguf` | Q3_K_S | 11 GB (3.52 BPW) | Smaller-disk fallback; ~22% of bf16 reference |

The bf16 intermediate (~50 GB) is intentionally **not** uploaded - it is recoverable any time from the source repo + `convert_hf_to_gguf.py --outtype bf16`. See "Reproducibility" below.


## Usage

### `llama-server` (recommended)

```bash
./llama-server \
    -m Qwen3.6-35B-A3B-REAM-192-Q4_K_M.gguf \
    -c 65536 \
    --jinja \
    --host 127.0.0.1 --port 8080
```

OpenAI-compatible chat completions on `http://127.0.0.1:8080/v1/chat/completions`. To disable Qwen3.6's `<think>` blocks, pass `chat_template_kwargs: {"enable_thinking": false}` in the request body.

### One-shot via `llama-completion`

```bash
./llama-completion \
    -m Qwen3.6-35B-A3B-REAM-192-Q4_K_M.gguf \
    --jinja \
    --chat-template-kwargs '{"enable_thinking":false}' \
    -p "Your prompt here" \
    -n 500 --temp 0.6
```

### Vision (optional, drop-in)

This is a text-only release. The REAM-merged language model is architecturally identical (same hidden dim, same MoE layout) to the REAP-26B variant, so the existing atbender-derived mmproj from [`keithnull/Qwen3.6-VL-REAP-26B-A3B-GGUF`](https://huggingface.co/keithnull/Qwen3.6-VL-REAP-26B-A3B-GGUF) (`mmproj-REAP-26B-F16.gguf`) is a drop-in vision projector. Pair it via `--mmproj` for vision-enabled inference.

## Method

1. Source: bf16 safetensors at [`keithnull/Qwen3.6-35B-A3B-REAM-192`](https://huggingface.co/keithnull/Qwen3.6-35B-A3B-REAM-192) (52 GB on HF; 27.05B params; vision tower + MTP layer preserved unchanged from upstream Qwen3.6-35B-A3B).
2. Convert to GGUF with `convert_hf_to_gguf.py --outtype bf16` — **never** `--outtype f16`, because Qwen3.6's GDN (Gated Delta Network) linear-attention layers overflow fp16 silently and propagate NaN through any downstream quants.
3. Quantize via `llama-quantize` to `Q4_K_M` and `Q3_K_S` using llama.cpp upstream HEAD (~2026-05-04).

### Reproducibility note

`convert_hf_to_gguf.py` from current upstream HEAD raises `NotImplementedError("BPE pre-tokenizer was not recognized")` on this model because the REAM-merged tokenizer's `chkhsh` (`1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f`) isn't yet registered. The fix is a single hash entry that maps it to `res = "qwen2"` — the Qwen2 / 3 / 3.5 / 3.6 family all share the same BPE pre-tokenization regex; only the vocab (and therefore the chkhsh) differ. The `tokenizer.json` round-trip during REAM's `save_pretrained` slightly re-orders fields, producing a new fingerprint despite the tokenizer being functionally identical to the upstream Qwen3.6-35B-A3B tokenizer.

To reproduce locally, before the convert call:

```python
import re
PATH_ = "llama.cpp/convert_hf_to_gguf.py"
HASH  = "1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f"
src = open(PATH_).read()
if HASH not in src:
    i  = src.index("def get_vocab_base_pre(")
    j  = src.index("if chkhsh ==", i)
    ls = src.rfind("\n", 0, j) + 1
    ind = src[ls:j]
    patch = (
        f'{ind}if chkhsh == "{HASH}":\n'
        f'{ind}    # Qwen3.6 (REAM-merged) — Qwen2-family BPE pre-tokenization\n'
        f'{ind}    res = "qwen2"\n'
    )
    open(PATH_, "w").write(src[:ls] + patch + src[ls:])
```