Instructions to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF", filename="Qwen3.6-35B-A3B-REAM-192-Q3_K_S.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S # Run inference directly in the terminal: llama-cli -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S # Run inference directly in the terminal: llama-cli -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S # Run inference directly in the terminal: ./llama-cli -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S # Run inference directly in the terminal: ./build/bin/llama-cli -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
Use Docker
docker model run hf.co/keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
- LM Studio
- Jan
- vLLM
How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
- Ollama
How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with Ollama:
ollama run hf.co/keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
- Unsloth Studio
How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF to start chatting
- Pi
How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
Run Hermes
hermes
- Docker Model Runner
How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with Docker Model Runner:
docker model run hf.co/keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
- Lemonade
How to use keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S
Run and chat with the model
lemonade run user.Qwen3.6-35B-A3B-REAM-192-GGUF-Q3_K_S
List all available models
lemonade list
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S# Run inference directly in the terminal:
llama-cli -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_SUse pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S# Run inference directly in the terminal:
./llama-cli -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_SBuild from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S# Run inference directly in the terminal:
./build/bin/llama-cli -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_SUse Docker
docker model run hf.co/keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_SQwen3.6-35B-A3B-REAM-192 — GGUF quants
GGUF quantizations of keithnull/Qwen3.6-35B-A3B-REAM-192 — the REAM expert-merged variant of Qwen/Qwen3.6-35B-A3B (256 → 192 routed experts, 35.11B → 27.05B params, −23%).
Preliminary HumanEval comparison
⚠️ Read the caveats below before drawing conclusions. This is a relative quant-quality reference, not an absolute capability benchmark. Numbers do not directly compare to published Qwen3.6-35B-A3B HumanEval scores (which use chat template + thinking enabled).
Results (HumanEval pass@1, raw completion, greedy, thinking off)
| Model | Compression | Disk | pass@1 | ± SE | Notes |
|---|---|---|---|---|---|
| Qwen3.6-35B-A3B-REAM-192 bf16 (reference) | 256→192 experts (REAM merge) | ~52 GB | 0.7134 | 0.0354 | Our bf16 source; reference for quant degradation |
| Qwen3.6-35B-A3B-REAM-192 Q4_K_M (this repo) | + Q4_K_M quant | 16 GB | 0.6768 | 0.0366 | Daily-driver candidate |
| Qwen3.6-35B-A3B-REAM-192 Q3_K_S (this repo) | + Q3_K_S quant | 11 GB | 0.6768 | 0.0366 | Same pass@1 as Q4_K_M; ~30% smaller |
| REAP-26B Q4_K_M (atbender prune, same 256→192 ratio) | 256→192 experts (REAP prune) | ~15 GB | 0.6646 | 0.0370 | Direct merge-vs-prune comparison at same compression |
| Bartowski Q4_K_M | None + Q4_K_M | ~17 GB | 0.6463 | 0.0374 | Reference for "vanilla quant of full unmerged model" |
| Unsloth Qwen3.6-35B-A3B UD-Q4_K_M | None + UD-Q4 | ~21 GB | 0.6280 | 0.0379 | Unsloth's dynamic-quant variant of the full unmerged model |
Raw lm-eval output (results JSON + run logs) for all six runs is available in eval-results-2026-05-05/.
Headline reads
- Q3_K_S = Q4_K_M on this benchmark. Both passed exactly 111/164 problems. At ~30% smaller disk footprint (11 GB vs 16 GB), Q3_K_S is a viable memory-constrained alternative for code work.
- Calibration data likely did real work. Both REAM and REAP used atbender's code-heavy recipe (SWE-smith, xLAM, evol-codealpaca, mix-of-thoughts). Plausible that this concentrated code-relevant capability into the kept 192 experts, partially offsetting the parameter-count loss. A non-code-calibrated REAM/REAP variant would likely score lower on HumanEval but better elsewhere.
- REAM Q4_K_M reaches the unmerged-base's quality floor at notably smaller disk — 16 GB vs 17 GB (Bartowski) or 21 GB (Unsloth UD). For memory-constrained users, REAM Q4 (or even Q3) is a quality-preserving size win.
Caveats: confidence intervals overlap substantially (REAM 64.0-71.3% vs REAP 62.8-70.2% vs Bartowski 60.9-68.4%), so all of the above is directional on this single benchmark. Ranking could shift on broader code benchmarks or with multi-seed sampling. 3. bf16 → Q4_K_M loses ~3.7 pp; Q4_K_M → Q3_K_S loses 0 pp on this benchmark.
Methodology (what's being measured)
- Task: lm-eval-harness
humaneval(164 problems, OpenAI 2021) - Decoding: greedy (
temperature=0),max_gen_toks=1024 - Mode: raw completion (no
--apply_chat_template), thinking off - Stop sequences: lm-eval default (
'\nclass', '\ndef', '\n#', '\nif', '\nprint') - Inference engines: vLLM 0.6+ for bf16, upstream
llama-server(b9020-era, CUDA build) for GGUFs - Tokenizer: shared (
/workspace/REAM-192-bf16HF dir) across all GGUF runs to ensure consistent tokenization client-side - Hardware: RunPod A100 SXM (80 GB VRAM), one model at a time
Why these numbers don't match published Qwen3.6 HumanEval (~85%)
- No chat template applied. Qwen3.6 is instruct-tuned and benefits substantially from its chat formatting (typically +5-10 pp). We dropped
--apply_chat_templatebecause lm-eval 0.4.x'slocal-completionsadapter has a known bug with chat-formattedgenerate_untilrequests (sends malformedpromptarray → server returns 400). A chat-template-aware variant of this eval is on the follow-up list. - Thinking off. Qwen's headline 85% number assumes CoT/thinking enabled. Disabling thinking for this eval costs ~10-15 pp.
- Default lm-eval task config. The
humanevaltask's stop sequences and prompt format are designed for base models doing raw code completion. Instruct models like Qwen3.6 lose points by emitting natural-language preamble before the function body that doesn't match these stops.
Caveats — don't over-claim from this single benchmark
- One narrow code metric. HumanEval pass@1 is one of many code-quality benchmarks. Broader coverage (BigCodeBench, LiveCodeBench, real agentic tasks like Aider) may show different patterns.
- Small sample size. 164 problems means standard error is ±3.5-3.8 pp. Some inter-model gaps are barely outside the confidence-interval overlap.
- Single-run, no multi-seed variance. With greedy decoding the run is deterministic, but minor prompt-formatting differences (max length, BOS handling, stop sequences) between engines can introduce noise we haven't measured.
- Methodology biases against published numbers. This eval doesn't measure "how good is REAM in real chat usage" — it measures "did the GGUF quants preserve the bf16's behavior in raw-completion mode." Different question, different answer.
- Untested in other tasks. No MMLU yet (deferred — needs
num_concurrent=8to be tractable on lm-eval'slocal-completions). No GSM8K (CoT generation slow). No comparison vs unmerged Qwen3.6 BF16 (separate, larger eval).
What this is good for
Choosing between REAM Q4_K_M and Q3_K_S for your own setup: they're equivalent on the HumanEval pass@1 metric, so pick by disk/memory headroom — but with a caveat from local agentic-bench observations. On our
lil-quickagent-loop bench (2 standalone code-gen tasks againstlil/piwith the daily-driverllama-serverstack), Q3_K_S took26% longer than Q4_K_M with thinking enabled (1810s vs ~1440s total) and produced visibly less thorough project scaffolding — Q4_K_M tended to set up a full TypeScript project (package.json + tsconfig + vitest config + impl + tests, 6 files), while Q3_K_S more often produced just impl + tests (2-4 files) and skipped the surrounding infra. The code itself was usually correct in both cases, but Q3 used more thinking iterations to arrive there. So:- Q4_K_M if you can spare the 5 GB and want fuller, faster agent loops.
- Q3_K_S if you're memory-constrained and don't mind slightly tighter outputs / longer think loops on agentic tasks. Single-shot completion (HumanEval-style "fill in the function") is unaffected.
Sanity-checking that the REAM merge didn't damage code reasoning at a level a code agent would notice: it didn't.
Comparing quant pipelines (REAM merge → vanilla Q4_K_M vs unmerged base → Unsloth UD-Q4): vanilla Q4 of merged scored higher than UD-Q4 of unmerged here. Surprising; deserves more eval coverage.
Files
| File | Quant | Size | Notes |
|---|---|---|---|
Qwen3.6-35B-A3B-REAM-192-Q4_K_M.gguf |
Q4_K_M | 16 GB | Balanced quality/size; recommended starting point |
Qwen3.6-35B-A3B-REAM-192-Q3_K_S.gguf |
Q3_K_S | 11 GB (3.52 BPW) | Smaller-disk fallback; ~22% of bf16 reference |
The bf16 intermediate (~50 GB) is intentionally not uploaded - it is recoverable any time from the source repo + convert_hf_to_gguf.py --outtype bf16. See "Reproducibility" below.
Usage
llama-server (recommended)
./llama-server \
-m Qwen3.6-35B-A3B-REAM-192-Q4_K_M.gguf \
-c 65536 \
--jinja \
--host 127.0.0.1 --port 8080
OpenAI-compatible chat completions on http://127.0.0.1:8080/v1/chat/completions. To disable Qwen3.6's <think> blocks, pass chat_template_kwargs: {"enable_thinking": false} in the request body.
One-shot via llama-completion
./llama-completion \
-m Qwen3.6-35B-A3B-REAM-192-Q4_K_M.gguf \
--jinja \
--chat-template-kwargs '{"enable_thinking":false}' \
-p "Your prompt here" \
-n 500 --temp 0.6
Vision (optional, drop-in)
This is a text-only release. The REAM-merged language model is architecturally identical (same hidden dim, same MoE layout) to the REAP-26B variant, so the existing atbender-derived mmproj from keithnull/Qwen3.6-VL-REAP-26B-A3B-GGUF (mmproj-REAP-26B-F16.gguf) is a drop-in vision projector. Pair it via --mmproj for vision-enabled inference.
Method
- Source: bf16 safetensors at
keithnull/Qwen3.6-35B-A3B-REAM-192(52 GB on HF; 27.05B params; vision tower + MTP layer preserved unchanged from upstream Qwen3.6-35B-A3B). - Convert to GGUF with
convert_hf_to_gguf.py --outtype bf16— never--outtype f16, because Qwen3.6's GDN (Gated Delta Network) linear-attention layers overflow fp16 silently and propagate NaN through any downstream quants. - Quantize via
llama-quantizetoQ4_K_MandQ3_K_Susing llama.cpp upstream HEAD (~2026-05-04).
Reproducibility note
convert_hf_to_gguf.py from current upstream HEAD raises NotImplementedError("BPE pre-tokenizer was not recognized") on this model because the REAM-merged tokenizer's chkhsh (1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f) isn't yet registered. The fix is a single hash entry that maps it to res = "qwen2" — the Qwen2 / 3 / 3.5 / 3.6 family all share the same BPE pre-tokenization regex; only the vocab (and therefore the chkhsh) differ. The tokenizer.json round-trip during REAM's save_pretrained slightly re-orders fields, producing a new fingerprint despite the tokenizer being functionally identical to the upstream Qwen3.6-35B-A3B tokenizer.
To reproduce locally, before the convert call:
import re
PATH_ = "llama.cpp/convert_hf_to_gguf.py"
HASH = "1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f"
src = open(PATH_).read()
if HASH not in src:
i = src.index("def get_vocab_base_pre(")
j = src.index("if chkhsh ==", i)
ls = src.rfind("\n", 0, j) + 1
ind = src[ls:j]
patch = (
f'{ind}if chkhsh == "{HASH}":\n'
f'{ind} # Qwen3.6 (REAM-merged) — Qwen2-family BPE pre-tokenization\n'
f'{ind} res = "qwen2"\n'
)
open(PATH_, "w").write(src[:ls] + patch + src[ls:])
- Downloads last month
- 871
3-bit
4-bit
Model tree for keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF
Base model
Qwen/Qwen3.6-35B-A3B
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S# Run inference directly in the terminal: llama-cli -hf keithnull/Qwen3.6-35B-A3B-REAM-192-GGUF:Q3_K_S