Instructions to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF",
	filename="Qwen3.6-35B-A3B-REAM-192-Q3_K_S.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
# Run inference directly in the terminal:
llama-cli -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
# Run inference directly in the terminal:
llama-cli -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
# Run inference directly in the terminal:
./llama-cli -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
# Run inference directly in the terminal:
./build/bin/llama-cli -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S

Use Docker

docker model run hf.co/keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S

LM Studio
Jan

vLLM

How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S

Ollama
How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with Ollama:
```
ollama run hf.co/keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
```

Unsloth Studio

How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF to start chatting

How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S

Run Hermes

hermes

Docker Model Runner
How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with Docker Model Runner:
```
docker model run hf.co/keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
```

Lemonade

How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S

Run and chat with the model

lemonade run user.Qwen3.6-35B-A3B-REAM-192-GGUF-Q3_K_S

List all available models

lemonade list

Qwen3.6-35B-A3B-REAM-192 — GGUF quants

GGUF quantizations of keithnull/Qwen3.6-35B-A3B-REAM-192 — the REAM expert-merged variant of Qwen/Qwen3.6-35B-A3B (256 → 192 routed experts, 35.11B → 27.05B params, −23%).

Preliminary HumanEval comparison

⚠️ Read the caveats below before drawing conclusions. This is a relative quant-quality reference, not an absolute capability benchmark. Numbers do not directly compare to published Qwen3.6-35B-A3B HumanEval scores (which use chat template + thinking enabled).

Results (HumanEval pass@1, raw completion, greedy, thinking off)

Model	Compression	Disk	pass@1	± SE	Notes
Qwen3.6-35B-A3B-REAM-192 bf16 (reference)	256→192 experts (REAM merge)	~52 GB	0.7134	0.0354	Our bf16 source; reference for quant degradation
Qwen3.6-35B-A3B-REAM-192 Q4_K_M (this repo)	+ Q4_K_M quant	16 GB	0.6768	0.0366	Daily-driver candidate
Qwen3.6-35B-A3B-REAM-192 Q3_K_S (this repo)	+ Q3_K_S quant	11 GB	0.6768	0.0366	Same pass@1 as Q4_K_M; ~30% smaller
REAP-26B Q4_K_M (atbender prune, same 256→192 ratio)	256→192 experts (REAP prune)	~15 GB	0.6646	0.0370	Direct merge-vs-prune comparison at same compression
Bartowski Q4_K_M	None + Q4_K_M	~17 GB	0.6463	0.0374	Reference for "vanilla quant of full unmerged model"
Unsloth Qwen3.6-35B-A3B UD-Q4_K_M	None + UD-Q4	~21 GB	0.6280	0.0379	Unsloth's dynamic-quant variant of the full unmerged model

Raw lm-eval output (results JSON + run logs) for all six runs is available in eval-results-2026-05-05/.

Headline reads

Q3_K_S = Q4_K_M on this benchmark. Both passed exactly 111/164 problems. At ~30% smaller disk footprint (11 GB vs 16 GB), Q3_K_S is a viable memory-constrained alternative for code work.
Calibration data likely did real work. Both REAM and REAP used atbender's code-heavy recipe (SWE-smith, xLAM, evol-codealpaca, mix-of-thoughts). Plausible that this concentrated code-relevant capability into the kept 192 experts, partially offsetting the parameter-count loss. A non-code-calibrated REAM/REAP variant would likely score lower on HumanEval but better elsewhere.
REAM Q4_K_M reaches the unmerged-base's quality floor at notably smaller disk — 16 GB vs 17 GB (Bartowski) or 21 GB (Unsloth UD). For memory-constrained users, REAM Q4 (or even Q3) is a quality-preserving size win.

Caveats: confidence intervals overlap substantially (REAM 64.0-71.3% vs REAP 62.8-70.2% vs Bartowski 60.9-68.4%), so all of the above is directional on this single benchmark. Ranking could shift on broader code benchmarks or with multi-seed sampling. 3. bf16 → Q4_K_M loses ~3.7 pp; Q4_K_M → Q3_K_S loses 0 pp on this benchmark.

Methodology (what's being measured)

Task: lm-eval-harness humaneval (164 problems, OpenAI 2021)
Decoding: greedy (temperature=0), max_gen_toks=1024
Mode: raw completion (no --apply_chat_template), thinking off
Stop sequences: lm-eval default ('\nclass', '\ndef', '\n#', '\nif', '\nprint')
Inference engines: vLLM 0.6+ for bf16, upstream llama-server (b9020-era, CUDA build) for GGUFs
Tokenizer: shared (/workspace/REAM-192-bf16 HF dir) across all GGUF runs to ensure consistent tokenization client-side
Hardware: RunPod A100 SXM (80 GB VRAM), one model at a time

Why these numbers don't match published Qwen3.6 HumanEval (~85%)

No chat template applied. Qwen3.6 is instruct-tuned and benefits substantially from its chat formatting (typically +5-10 pp). We dropped --apply_chat_template because lm-eval 0.4.x's local-completions adapter has a known bug with chat-formatted generate_until requests (sends malformed prompt array → server returns 400). A chat-template-aware variant of this eval is on the follow-up list.
Thinking off. Qwen's headline 85% number assumes CoT/thinking enabled. Disabling thinking for this eval costs ~10-15 pp.
Default lm-eval task config. The humaneval task's stop sequences and prompt format are designed for base models doing raw code completion. Instruct models like Qwen3.6 lose points by emitting natural-language preamble before the function body that doesn't match these stops.

Caveats — don't over-claim from this single benchmark

One narrow code metric. HumanEval pass@1 is one of many code-quality benchmarks. Broader coverage (BigCodeBench, LiveCodeBench, real agentic tasks like Aider) may show different patterns.
Small sample size. 164 problems means standard error is ±3.5-3.8 pp. Some inter-model gaps are barely outside the confidence-interval overlap.
Single-run, no multi-seed variance. With greedy decoding the run is deterministic, but minor prompt-formatting differences (max length, BOS handling, stop sequences) between engines can introduce noise we haven't measured.
Methodology biases against published numbers. This eval doesn't measure "how good is REAM in real chat usage" — it measures "did the GGUF quants preserve the bf16's behavior in raw-completion mode." Different question, different answer.
Untested in other tasks. No MMLU yet (deferred — needs num_concurrent=8 to be tractable on lm-eval's local-completions). No GSM8K (CoT generation slow). No comparison vs unmerged Qwen3.6 BF16 (separate, larger eval).

What this is good for

Choosing between REAM Q4_K_M and Q3_K_S for your own setup: they're equivalent on the HumanEval pass@1 metric, so pick by disk/memory headroom — but with a caveat from local agentic-bench observations. On our lil-quick agent-loop bench (2 standalone code-gen tasks against lil/pi with the daily-driver llama-server stack), Q3_K_S took ~~26% longer than Q4_K_M with thinking enabled (~~1810s vs ~1440s total) and produced visibly less thorough project scaffolding — Q4_K_M tended to set up a full TypeScript project (package.json + tsconfig + vitest config + impl + tests, 6 files), while Q3_K_S more often produced just impl + tests (2-4 files) and skipped the surrounding infra. The code itself was usually correct in both cases, but Q3 used more thinking iterations to arrive there. So:
- Q4_K_M if you can spare the 5 GB and want fuller, faster agent loops.
- Q3_K_S if you're memory-constrained and don't mind slightly tighter outputs / longer think loops on agentic tasks. Single-shot completion (HumanEval-style "fill in the function") is unaffected.
Sanity-checking that the REAM merge didn't damage code reasoning at a level a code agent would notice: it didn't.
Comparing quant pipelines (REAM merge → vanilla Q4_K_M vs unmerged base → Unsloth UD-Q4): vanilla Q4 of merged scored higher than UD-Q4 of unmerged here. Surprising; deserves more eval coverage.

Files

File	Quant	Size	Notes
`Qwen3.6-35B-A3B-REAM-192-Q4_K_M.gguf`	Q4_K_M	16 GB	Balanced quality/size; recommended starting point
`Qwen3.6-35B-A3B-REAM-192-Q3_K_S.gguf`	Q3_K_S	11 GB (3.52 BPW)	Smaller-disk fallback; ~22% of bf16 reference

The bf16 intermediate (~50 GB) is intentionally not uploaded - it is recoverable any time from the source repo + convert_hf_to_gguf.py --outtype bf16. See "Reproducibility" below.

Usage

`llama-server` (recommended)

./llama-server \
    -m Qwen3.6-35B-A3B-REAM-192-Q4_K_M.gguf \
    -c 65536 \
    --jinja \
    --host 127.0.0.1 --port 8080

OpenAI-compatible chat completions on http://127.0.0.1:8080/v1/chat/completions. To disable Qwen3.6's <think> blocks, pass chat_template_kwargs: {"enable_thinking": false} in the request body.

One-shot via `llama-completion`

./llama-completion \
    -m Qwen3.6-35B-A3B-REAM-192-Q4_K_M.gguf \
    --jinja \
    --chat-template-kwargs '{"enable_thinking":false}' \
    -p "Your prompt here" \
    -n 500 --temp 0.6

Vision (optional, drop-in)

This is a text-only release. The REAM-merged language model is architecturally identical (same hidden dim, same MoE layout) to the REAP-26B variant, so the existing atbender-derived mmproj from keithnull/Qwen3.6-VL-REAP-26B-A3B-GGUF (mmproj-REAP-26B-F16.gguf) is a drop-in vision projector. Pair it via --mmproj for vision-enabled inference.

Method

Source: bf16 safetensors at keithnull/Qwen3.6-35B-A3B-REAM-192 (52 GB on HF; 27.05B params; vision tower + MTP layer preserved unchanged from upstream Qwen3.6-35B-A3B).
Convert to GGUF with convert_hf_to_gguf.py --outtype bf16 — never --outtype f16, because Qwen3.6's GDN (Gated Delta Network) linear-attention layers overflow fp16 silently and propagate NaN through any downstream quants.
Quantize via llama-quantize to Q4_K_M and Q3_K_S using llama.cpp upstream HEAD (~2026-05-04).

Reproducibility note

convert_hf_to_gguf.py from current upstream HEAD raises NotImplementedError("BPE pre-tokenizer was not recognized") on this model because the REAM-merged tokenizer's chkhsh (1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f) isn't yet registered. The fix is a single hash entry that maps it to res = "qwen2" — the Qwen2 / 3 / 3.5 / 3.6 family all share the same BPE pre-tokenization regex; only the vocab (and therefore the chkhsh) differ. The tokenizer.json round-trip during REAM's save_pretrained slightly re-orders fields, producing a new fingerprint despite the tokenizer being functionally identical to the upstream Qwen3.6-35B-A3B tokenizer.

To reproduce locally, before the convert call:

import re
PATH_ = "llama.cpp/convert_hf_to_gguf.py"
HASH  = "1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f"
src = open(PATH_).read()
if HASH not in src:
    i  = src.index("def get_vocab_base_pre(")
    j  = src.index("if chkhsh ==", i)
    ls = src.rfind("\n", 0, j) + 1
    ind = src[ls:j]
    patch = (
        f'{ind}if chkhsh == "{HASH}":\n'
        f'{ind}    # Qwen3.6 (REAM-merged) — Qwen2-family BPE pre-tokenization\n'
        f'{ind}    res = "qwen2"\n'
    )
    open(PATH_, "w").write(src[:ls] + patch + src[ls:])

Downloads last month: 871

GGUF

Model size

27B params

Architecture

qwen35moe

Hardware compatibility

3-bit

4-bit

Model tree for keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

keithnull/Qwen3.6-35B-A3B-REAM-192

Quantized

(1)

this model