Instructions to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m", filename="coe-svdv2-physics-bb128-r31-q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M # Run inference directly in the terminal: llama-cli -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M # Run inference directly in the terminal: llama-cli -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
Use Docker
docker model run hf.co/JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
- Ollama
How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with Ollama:
ollama run hf.co/JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
- Unsloth Studio
How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m to start chatting
- Pi
How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with Docker Model Runner:
docker model run hf.co/JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
- Lemonade
How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
Run and chat with the model
lemonade run user.coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m-Q4_K_M
List all available models
lemonade list
Run and chat with the model
lemonade run user.coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m-Q4_K_MList all available models
lemonade list- coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m — College of Experts Specialist
coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m — College of Experts Specialist
This is the MMLU-Pro Physics specialist in the 7-model Qwen3.6 College of Experts release (mmlu_pro corpus, r=31 R-list admixture). It is one of: physics, chemistry, math, law, engineering, coding_python, coding_web. See the FAMILY_README for full context, the cross-domain r-percent findings, and the construction pipeline.
Base model: Qwen/Qwen3.6-35B-A3B (Apache 2.0) Architecture: 40 layers × 256 routed experts/layer + 1 shared expert/layer, TOP_K=8 routed + 1 shared per token CoE construction: prune each MoE layer from 256 → 128 routed experts (50% K-budget reduction), with optional R-list admixture at fraction r ∈ {0.25, 0.31} Quantization: Q4_K_M
⚠️ Beta Release — Safety Disclaimer
These models are beta releases and should be treated as research artifacts, not production-ready systems.
Expert surgery selects and retains domain-relevant experts based on activation patterns observed during profiling. The pruning pipeline is designed solely to create a coherent domain specialist — it has no mechanism to identify which experts contribute to model alignment, ethical reasoning, or safety guardrails. As a result, experts responsible for enforcing those behaviours may have been inadvertently removed during the surgery process.
Appropriate use of any model in the College of Experts family is the sole
responsibility of the end user. The authors make no representation that these
models retain the safety properties of the parent Qwen/Qwen3.6-35B-A3B model,
and users should not rely on them as a substitute for models that have
undergone safety evaluation.
⚠️ Critical Usage Note — Think-Off Mode
All models in this series must be used in thinking-off mode.
If you are using the Ollama API, pass "think": false in your request body. If
you are accessing the model via a raw API (llama.cpp server, OpenAI-compatible
endpoint, etc.) you must inject a closed thinking block at the start of the
assistant turn:
messages = [
{"role": "system", "content": "Your system prompt here."},
{"role": "user", "content": "Your question here."},
{"role": "assistant", "content": "<think></think>\n"}, # <-- required prefill
]
Why this is required: expert surgery retains 50% of the routed expert pool
per layer, selecting experts that are maximally active on domain content and
chain-of-thought reasoning. A side effect is that the loop-suppression
experts — which activate on metacognitive closure signals near the end of a
<think> block — do not have a concentrated domain-specific activation
signature and are disproportionately pruned. In think-on mode, this causes the
model to enter a reasoning loop that exhausts the token budget without
producing a final answer. In extreme cases, the loop rate is 60–70% on hard
questions.
The <think></think> prefill works by consuming the opening <think> token
before generation starts, so the model sees its thinking as already complete
and proceeds directly to answering. This is the mechanism used in all
benchmarks reported here.
What think-off mode does not disable: Qwen3.6's chain-of-thought training is deeply ingrained. Even with the think block closed, the model produces brief inline reasoning interleaved with its answer — shorter and more linear than a full scratchpad, but present. All benchmark figures in this README are measured in this constrained-implicit-CoT mode.
Ollama Modelfile Template
{domain} is a placeholder. Replace it with the model's domain (e.g. physics, law, python coding) before creating the model.
FROM <model_path_or_ollama_tag>
PARAMETER temperature 0.6
PARAMETER repeat_penalty 1.05
PARAMETER num_ctx 8192
PARAMETER num_predict 8192
PARAMETER think false
SYSTEM """
You are a {domain} expert assistant. Answer the user's question.
"""
Temperature 0.6 is strongly recommended. Higher temperatures (≥ 0.8) materially increase loop rates in think-off mode.
What Are These Models?
These models are produced by activation-directed expert surgery applied to the Qwen3.6-35B-A3B base. The surgery does not change any weight values — it prunes the routed-expert FFN weight tensors that are not part of the domain-specialist mask, then saves the result as a smaller GGUF. No post surgery fine tuning or training was done. No specific effort was taken to either preserve or remove the vision/image input capabilities native to the parent model, but cursory testing confirms image input capability does remain but this has not been tested to establish the extent of retained vision ability.
For each release candidate, the mask is built in three stages, each derived from a distinct data signal collected in two separate forward passes through the parent model.
Stage 1 — Per-entry fingerprints and the K95 representation space.
A per-domain corpus is profiled by running each entry through the parent
model with router hooks attached to every MoE layer. For each token, the
router selects the top-8 experts and assigns softmax weights; the per-entry
fingerprint is the per-layer activation vector (40, 256), sum-normalized
per entry so that long entries do not dominate the aggregate. All 26 v2
domains' fingerprints are pooled and decomposed by uncentered SVD to produce
the representation space combined_v2_basis_svd_raw_norm.pt at K95
retention (40 layers, K95 dims per layer, total 2,956 dimensions; per-layer
range 41–113, mean 74). This is the space in which each expert has a unique
point and in which the domain centroid lives.
Stage 2 — K=128 cosine mask and the backbone.
For each candidate model domain, the entries that make up that domain's
corpus are pooled (a working group like physical_sciences pools the
chemistry + physics + engineering entries; the per-source centroids of an
abandoned combined model would differ from the per-source centroids of the
kept separate models, which is why the combined-domain attempt was
abandoned). The per-layer domain centroid μ is the entry-mean of the pooled
fingerprints. Projecting μ and each expert's column of Vt[:K95] into the
K95-SVD space gives a 256-element ranking of all experts per layer by
cosine similarity of the expert's direction to the centroid's direction
— i.e., semantic closeness to the domain. The top-128 by this cosine
ranking per layer is the K=128 cosine mask for that domain.
The backbone is the per-layer intersection of the K=128 cosine masks across all 21 (v1) or 26 (v2) domains — the experts that land in the top-128 of every domain. These are the universally co-activated experts and are locked during the R-list walk so they cannot be expelled by the admixture.
Stage 3 — R-list admixture (the r-percent step).
Independently of Stage 1, a second forward pass through the parent model
collects the per-domain per-token 3D histogram (40, 256, 8) of router
selections and softmax-weight sums. From this we compute, for each
(layer, expert):
R(l, e) = wm0 = weight_sum[l, e, 0] / max(hist[l, e, 0], 1)
with R(l, e) = 0 if hist[l, e, 0] < 5 (the min_count=5 filter
suppresses experts that have been the top-1 pick fewer than 5 times).
R is a mean softmax weight — a router-commitment signal: how
strongly the top-8 router commits to expert e on the occasions it
picks e as the top-1. Specialists (rare-but-decisive picks with high
softmax mass) outrank workhorses (frequent low-mass picks) on this
signal even when the latter have higher raw selection counts.
The R-list is the experts with the highest R, walked in descending
order. Starting from the K=128 cosine mask, the top r-budget R-list
experts (not already in the mask and not in the backbone) are injected
by evicting the lowest-cosine non-backbone experts one-for-one. The
result is a (40, 128) int16 mask file.
Note on data sources. The fingerprints (Stage 1 input) and the 3D histograms (Stage 3 input) come from two independent forward passes through the parent model. The fingerprints are not derivable from the histograms: the fingerprints are per-entry sum-normalized and preserve the per-entry routing pattern, while the histograms are per-domain per-token aggregates that discard entry boundaries. Each data source carries information the other does not.
A GGUF is built by slicing the parent GGUF's routed-expert weight tensors to keep only the K=128 selected experts per layer. Attention, router, shared expert, embeddings, output head, and norms are all preserved at full count. The result is a smaller GGUF with the same per-token activation count as the parent (TOP_K=8 routed + 1 shared = 9 experts fire per token, regardless of pool size).
Memory Efficiency
The parent GGUF is 24 GB at q4_K_M; each specialist is 13 GB at q4_K_M — a 46% disk-footprint reduction. Using the same 13/24 disk-footprint ratio to scale the parent's 35B total parameter count gives the specialists 19B total parameters, while the active parameter count stays at 3B (TOP_K=8 + 1 shared, unchanged from the parent). Throughput (tokens/second) is identical between the specialist and the parent at the same quantization because the number of expert weight tensors that participate in each forward pass is the same. The saving is purely in VRAM residency — half the routed expert weight tensors simply do not need to be loaded.
| Disk (q4_K_M) | Total params | Active params | Reduction vs parent | |
|---|---|---|---|---|
Parent Qwen3.6-35B-A3B |
24 GB | 35B | 3B | — |
| All CoE release candidates | 13 GB | 19B | 3B | 46% |
With adequate but modest context of less than 8k tokens this is runnable on 16GB VRAM excluding any extrinsic VRAM loads.
All figures directly measured in Ollama.
Release Set (2026-06-16)
Seven domain-specialist CoE models derived from Qwen/Qwen3.6-35B-A3B. All non-coding
domains are benchmarked on MMLU-Pro off1 (the standard held-out split, approximately
100 QIDs per domain depending on the granular subject). The python and web coding models are
benchmarkd using HumanEval and a custom web coding bench respectively. The CoE result is compared
against the parent on the same QID set.
| Domain | CoE (HF repo) | CoE (ollama tag) | K | r | GGUF | n | CoE acc% | Parent acc% ‡ | Gap |
|---|---|---|---|---|---|---|---|---|---|
| physics | coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m |
coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b:q4_k_m |
128 | 31 | 12.28 GB | 130 | 86.92% | 86.92% | +0.00 |
| chemistry | coe-qwen3.6-mmlu_pro-chemistry-r31-19b-a3b-q4_k_m |
coe-qwen3.6-mmlu_pro-chemistry-r31-19b-a3b:q4_k_m |
128 | 31 | 12.5 GB | 114 | 89.47% | 78.95% / 85.09% | +10.52 |
| math | coe-qwen3.6-mmlu_pro-math-r31-19b-a3b-q4_k_m |
coe-qwen3.6-mmlu_pro-math-r31-19b-a3b:q4_k_m |
128 | 31 | 13.19 GB | 135 | 88.15% | 91.85% | −3.70 |
| law | coe-qwen3.6-mmlu_pro-law-r25-19b-a3b-q4_k_m |
coe-qwen3.6-mmlu_pro-law-r25-19b-a3b:q4_k_m |
128 | 25 | 13.19 GB | 110 | 67.27% | 68.18% | −0.91 |
| engineering | coe-qwen3.6-mmlu_pro-engineering-r25-19b-a3b-q4_k_m |
coe-qwen3.6-mmlu_pro-engineering-r25-19b-a3b:q4_k_m |
128 | 25 | 13.19 GB | 97 | 74.23% | 64.95% / 81.44% | +9.30 |
| coding (HumanEval) | coe-qwen3.6-hc-coding_python-r25-19b-a3b-q4_k_m |
coe-qwen3.6-hc-coding_python-r25-19b-a3b:q4_k_m |
128 | 25 | 12.28 GB | 164 | 88.11% (2-seed mean) | 90.55% (2-seed mean) | −2.44 |
| coding (web) | coe-qwen3.6-hc-coding_web-r25-19b-a3b-q4_k_m |
coe-qwen3.6-hc-coding_web-r25-19b-a3b:q4_k_m |
128 | 25 | 13 GB | 144 | 73.61% | 67.36% | +6.25 |
‡ Parent accuracy is reported at two inference settings for engineering and chemistry (the only two domains re-benched at full compute):
- 8k ctx / 8k predict / think-off (the canonical release harness — the first number shown, 64.95% / 78.95%) — what the CoE numbers are also measured against, so the Gap column is directly comparable.
- 32k ctx / 24k predict / think-on (the "max-compute" parent configuration — the second number shown, 81.44% / 85.09%) — the upper bound on parent accuracy at the most generous inference settings tested in this study (see
Parent at Full Computebelow).
On both engineering and chemistry the CoE wins on accuracy at the canonical harness; on engineering, the parent at full compute does close the gap and slightly exceed the r=25 CoE (81.44% vs 74.23%); on chemistry, the CoE r=31 holds the lead against the parent at any tested compute configuration (89.47% vs the parent's 85.09% full-compute number).
Two CoE models beat the parent at the canonical harness: chemistry (+10.52 pp) and engineering (+9.30 pp). The chemistry gain is the largest positive gap of any release domain. Three CoE models tie or slightly trail the parent at the canonical harness: physics (tie), law (−0.91 pp), math (−3.70 pp). Two coding-specialist variants show mixed results: coding_python is slightly behind the parent on HumanEval (−2.44 pp, 2-seed mean), but coding_web beats the parent by +6.25 pp on the Web Visual Generation Suite.
The two CoE models that beat the parent also have a key operational advantage: they are ~46% smaller on disk, and at 8k / think-off the CoE is faster than the parent at any tested configuration (the parent's full-24k think-on mode takes ~3× longer per QID than the CoE's 8k think-off). See Pass@k Protocol below for the practical implication when the user does not have a reference answer.
Naming Convention
Each release candidate is a separate HuggingFace repo. The naming convention
is coe-qwen3.6-{corpus}-{domain}-r{NN}-{size}-{active}-{quant}:
coe-qwen3.6-— College of Experts family, base model Qwen3.6{corpus}—mmlu_pro(profiled on MMLU-Pro stratified subsets) for the 5 academic domains, orhc(hand-curated corpus, e.g. HumanEval / LCB-derived for coding) for the 2 coding specialists{domain}— granular MMLU-Pro subject (orcoding_python/coding_webfor the coding specialists)r{NN}— R-list admixture fraction. The optimal r is a per-domain hyperparameter: r=31 for the dense-signal STEM domains (physics, chemistry, math), r=25 for the flatter-signal domains (law, engineering, and the two coding specialists). SeeCross-Domain r-Percent Findingsbelow for the empirical basis.{size}-{active}— total and active parameter counts in billions. All 7 release candidates are 19B total / 3B active (K=128 experts/layer out of 256, the standard 50% K-budget reduction; thebb128token used in earlier internal tags is from a deprecated experimental backbone-density sweep and is not part of the release naming convention).{quant}— q4_K_M quantization
The local ollama registry uses the same names with :q4_k_m instead of
-q4_k_m (ollama syntax), and the bb128 segment is preserved in the
internal ollama tag for traceability to the build script's CLI.
Per-Domain Benchmark Details
MMLU-Pro domains (5 release candidates)
Evaluation harness: an in-house MMLU-Pro pass@1 benchmark runner.
- Multi-model × multi-domain support
- Domain-appropriate system prompt (mathematician, lawyer, etc.)
- Single-attempt, no retry chain except on loop detection
- Loop detection: 3+ repeats triggers a single retry with a new seed
Settings: temperature 0.6, top_k 64, top_p 0.95, repeat_penalty 1.05,
num_ctx 8192, num_predict 8192, seed 42 (fixed), think-OFF (assistant prefill
<think></think>\n).
Coding domains (2 release candidates)
coding_python (HumanEval):
- 164 HumanEval problems, 2-seed mean accuracy (seeds 42 and 38762)
- Temperature 0.6, top_k 64, top_p 0.95, repeat_penalty 1.05
- CoE mean: 88.11% (88.41% seed=42, 87.80% seed=38762)
- Parent mean: 90.55% (90.24% seed=42, 90.85% seed=38762)
- Gap: −2.44 pp
coding_web (Web Visual Generation Suite):
- 144 prompts sampled from a 1-in-5 subset of the web coding corpus
- DOM rendering + visual inspection + functional assertions
- Dual-pass with live streaming, loop trap detection, multi-turn seed fallback
- Settings: as for coding_python above
- CoE: 73.61% (6 loop traps, 100% recovery rate via loop guard)
- Parent: 67.36% (0 loop traps)
- Gap: +6.25 pp
Cross-Domain r-Percent Findings
The r-percent R-list admixture fraction is a per-domain hyperparameter. Across the 7 release domains, the optimal r is consistently in the 25 to 31 percent range but the peak is definite but not particularly sharp. The grouping below reflects the empirical pattern observed in the per-domain r-sweeps rather than an a-priori taxonomy.
| Domain category | Preferred r | Empirical pattern |
|---|---|---|
| physics, chemistry, math | r=31 | dense R-signal: high-wm0 specialists dominate; more R-list admixture is monotonically better through r=31 |
| engineering | r=25 | medium R-signal; r=25 > r=31 on full 97 QIDs (engineering is a STEM domain in MMLU-Pro taxonomy but R-signal density tracks the per-domain corpus, not the field label) |
| law | r=25 | flatter R-signal; r=25 is a local maximum, r=22 ties r=0 cosine, r=19 falls below r=0 |
| python, web | r=25 | flatter R-signal in the LCB-derived python corpus; r=25 > r=28 on full 164 HumanEval tasks |
Parent at Full Compute
To verify that the engineering r=25 win and the chemistry r=31 lead are robust — and not artefacts of the canonical 8k-ctx think-off harness handicapping the parent — the parent was re-benched on engineering off1 (97 QIDs) and chemistry off1 (114 QIDs) at the most generous inference settings tested in this study:
| Knob | Value |
|---|---|
num_ctx |
32768 (32k) |
num_predict |
24576 (24k) |
think |
true (native thinking) |
| assistant prefill | none (think block open) |
| temperature, top_k, top_p, repeat_penalty, seed | same as canonical (0.6, 64, 0.95, 1.05, 42) |
The run was a post-import monkey-patch of the standard bench runner.
Results (single-shot, no pass@k)
| Run | n | corr | noans | wrong | acc% | wall time | mean sec/q |
|---|---|---|---|---|---|---|---|
| Parent eng 32k ctx / 24k predict / think-on | 97 | 79 | 6 | 12 | 81.44% | 255 min | 157.7 s |
| Parent chem 32k ctx / 24k predict / think-on | 114 | 97 | 13 | 4 | 85.09% | 327 min | 172.1 s |
Both numbers are substantially higher than the parent's canonical 8k-ctx / 8k-predict / think-off numbers (engineering: +16.49 pp, chemistry: +6.14 pp). The previous lower results in the literature on the parent model were a harness artifact — at the canonical 8k / 8k / think-off config, the parent's think block is also clipped at 8k tokens, and that clipping produces no-answers on hard questions that the model would have answered with more room.
Cross-domain comparison: CoE release vs Parent at full compute
| Domain | CoE (8k think-off) | Parent 8k think-off | Parent 32k/24k think-on | Best |
|---|---|---|---|---|
| engineering | 74.23% | 64.95% | 81.44% | Parent (full compute) by +7.21 pp |
| chemistry | 89.47% | 78.95% | 85.09% | CoE r=31 by +4.38 pp |
The CoE r=25 engineering release does not catch the parent at full compute; on chemistry, the CoE r=31 release still beats the parent at full compute by +4.38 pp (89.47% vs 85.09%). The CoE is also dramatically cheaper at inference — the parent's full-compute run takes 3.1× longer per QID than the CoE's 8k think-off run on engineering (158s vs 51s) and 2.5× longer on chemistry (172s vs 70s).
Practical reading. The CoE is the right "production default" on chemistry at 8k / think-off: 89.47% accuracy, 70 sec/QID, 13 GB. The parent at full compute is the right "maximum accuracy" mode when wall time is not the binding constraint — useful for benchmark reproduction or one-off deep analyses. The two are not in competition: they serve different operating points on the latency/accuracy trade-off.
Why the canonical 8k harness was handicapping the parent
The parent's 8k / 8k / think-off run produced 16 no-answers on chemistry
(78.95% accuracy) and 27 no-answers on engineering (64.95% accuracy). A
targeted rerun on the 16 chemistry no-answer QIDs at 24k predict recovered
11 of them, but did not retroactively change the 78.95% — those recovered
answers were already counted in the 90 / 114 = 78.95% via the cumulative
additive interpretation (see the ‡ footnote on the release table). The
full-domain rerun at 32k / 24k / think-on confirms this: 97/114 = 85.09%
on chemistry, +6.14 pp over canonical, with 13 no-answers instead of 16.
The bound on the parent at any tested compute configuration is therefore ~85% on chemistry, ~81% on engineering. The CoE at 8k / think-off matches or exceeds this on chemistry (89.47%) and trails on engineering (74.23% vs 81.44%).
Pass@k Protocol
For the engineering r=25 release, a pass@k sweep was run on the same 97
QIDs with the canonical 8k / think-off harness, retrying failed QIDs with
fresh seeds from the schedule [42, 7, 13, 99, 17, 31, 57, 89, 123, 251].
Each QID is re-rolled only on failure (i.e. the runner exits early on the
first correct attempt — the protocol the runner was designed for).
The pass@1 row of this table is the canonical engineering pass@1 number (72/97 = 74.23%) and is reused from the prior single-shot run; the runner only does the rounds 2..k for the unsolved QIDs. This avoids wasting ~99 min of compute re-running already-solved QIDs at attempt 1. Total wall-time budget: 255 min (matching the parent's full-compute single attempt engineering run, for like-for-like comparison).
Pass@k curve (corrected protocol — round 1 reused from prior run)
| k | n_run | solved / n | pass@k | Δ from pass@(k-1) | round wall (min) | cum wall (min) |
|---|---|---|---|---|---|---|
| 1 | 97 | 72/97 | 74.23% | — | 104.2 | 104.2 |
| 2 | 25 | 79/97 | 81.44% | +7.21 pp | 39.7 | 143.9 |
| 3 | 18 | 83/97 | 85.57% | +4.13 pp | 27.6 | 171.6 |
| 4 | 14 | 86/97 | 88.66% | +3.09 pp | 22.8 | 194.4 |
| 5 | 11 | 88/97 | 90.72% | +2.06 pp | 21.2 | 215.5 |
| 6 | 9 | 90/97 | 92.78% | +2.06 pp | 16.8 | 232.3 |
| 7 | 6 | 91/97 | 93.81% | +1.03 pp | 10.5 | 242.9 |
| 8-10 | (budget exhausted before next round) | 91/97 | 93.81% | 0 | — | — |
Time to reach pass@2: 143.9 min total (104.2 min for round 1 + 39.7 min for round 2 on the 25 unsolved QIDs). Average wall time per QID at pass@2: 89.0 s.
Time to reach plateau (pass@7 = pass@10 = 91/97): 242.9 min total (round 7 added 1 more QID at 10.5 min; rounds 8+ were cut by the 255 min budget). Beyond pass@7, no further QIDs are solvable on the remaining budget.
The 6 of the 97 QIDs that are "sticky" (never solved across all 1–7 attempts they were given) are: 11306, 11397, 11417, 11788, 11818, 12039. Of these, 11306, 11397, 11788 returned no-answer on most seeds; the other 3 produced wrong answers consistently.
Operating-point guidance for r=25. The cost-per-pp starts at
5.5 min/pp (pass@1→2) and roughly doubles by pass@5 (10.3 min/pp).
For a single-shot answer, pass@1 = 74.23% is the baseline. For a
self-consistent 2-of-3 protocol, pass@2 = 81.44% at 39.7 min of
extra cost is the recommended operating point — the marginal cost
between pass@2 and pass@7 is 99.0 min for +12.37 pp, which is rarely
worth it for production traffic. The 99.0-min / 12.37-pp ratio
(8.0 min/pp) is the upper-bound cost of "squeeze the last accuracy
out of pass@k."
CoE r=25 + pass@k vs Parent single-shot (engineering off1, 97 QIDs)
| Configuration | Time/QID | Accuracy | acc/min |
|---|---|---|---|
| CoE r=25 single-shot (8k / think-off) — pass@1 | 64.5 s | 74.23% | 1.151 |
| CoE r=25 + pass@2 | 89.0 s | 81.44% | 0.549 |
| CoE r=25 + pass@7 plateau | 160.6 s | 93.81% | 0.386 |
| Parent full-24k (32k / 24k / think-on) | 158 s | 81.44% | 0.310 |
(Time/QID for the CoE rows is the per-QID mean across all attempted runs, including sticky QIDs that consumed the full attempt budget.)
The CoE r=25 with pass@2 hits 81.44% at 89.0 s/QID — tying the parent's full-compute 81.44% at 158 s/QID at 0.56× the wall time. The plateau number 93.81% is the upper bound on what the r=25 model can achieve with unlimited retries (242.9 min total); the practical operating point is 81.44% at pass@2.
For real-world deployment without a reference answer, a "2-of-3 self-consistent" protocol on the CoE r=25 (run twice; if the two answers disagree, tag as "low confidence" and either escalate or re-ask) gives ~91% confidence on the served answer at 89.0 s/QID (the per-QID cost at pass@2 — the median of the two runs). See the CoE agent cascade writeup for the full privacy-and-cost analysis.
CoE Agent: Cascade + Privacy
The CoE agent framework (updated version planned for release) wraps the CoE models in a T0/T1/T2 cascade for real-world deployment:
- T0 — Local CoE r=25 first attempt (60s, 78% accurate, free, no network)
- T1 — Local retry with parser-aware hint (110s, +6% marginal accuracy, free, no network)
- T2 — Online SOTA escalation with cover-traffic padding (1 real + N-1 decoys per escalation) and SOTA model rotation (round-robin across Claude / GPT / Gemini/ GLM/ other SOTA tier)
The cascade numbers for engineering off1: 84.5% stay purely local, 15.5% escalate. The cost of the cascade at standard 4× padding + 3-way rotation is $1.86 per 100 QIDs at 1.29% effective privacy leakage — vs $30+ per 100 QIDs at 100% leakage for the all-online alternative using approximate API fees for proprietary models using the next to the latest tier of proprietary models.(ie 2 million output tokens at $15 per million)
Construction Pipeline
For each release candidate, the build follows the pipeline below. Steps 1 and 2 are two separate forward passes through the parent model — they are not collected simultaneously and produce distinct data products that are not derivable from one another:
- Forward pass #1 — profile histograms. The domain corpus is run through
the parent model with router hooks on every MoE layer. The result is a
per-domain 3D activation histogram of shape
(40, 256, 8)over (layer, expert, rank), plus a matching(40, 256, 8)tensor of softmax-weight sums. This is the only input to the R-list (Step 4). - Forward pass #2 — entry fingerprints. All domains/categories are
profiled again, this time recording activation per entry (sum-normalized per entry per layer). This is not derivable from the histograms — the fingerprints preserve per-entry routing patterns that the per-domain per-token histogram aggregation discards. The fingerprints are pooled across all 26 v2 domains and decomposed by uncentered SVD to build the K95 representation space (
combined_v2_basis_svd_raw_norm.pt). This space is the input to the cosine K=128 mask (Step 3). - Build a K=128 cosine mask by SVDv2 cosine similarity to the domain
centroid in K95 space, intersected with a backbone of experts that
survive across all profiled domains. The cosine is computed between the
direction of the per-layer domain centroid (projected from the pooled
fingerprint entry-mean) and the direction of each expert's column of
Vt[:K95]; the top-128 by cosine per layer is the K=128 cosine mask. - Build an r-percent mask starting from the K=128 cosine mask, removing
the backbone to obtain a swap-eligible set, and admixing the top R-budget
experts (ranked by the
R/wm0signal — mean softmax weight at rank-0, with a min_count=5 filter to suppress unobserved experts) at fraction r ∈ {0.25, 0.31}. The result is a (40, 128) int16 mask file. - Build the GGUF by slicing the parent GGUF's routed-expert FFN weight tensors to keep only the K=128 selected experts per layer. Attention, router, shared expert, embeddings, output head, and norms are all preserved at full count. The metadata patch updates the expert count from 256 to 128.
- Benchmark the new model on MMLU-Pro off1 (or HumanEval / Web Visual for coding domains) at temperature 0.6, num_ctx 8192, num_predict 8192, think-OFF, seed 42.
- Validate the r-percent choice by sweeping r ∈ {0.25, 0.28, 0.31, 0.34} on the new domain; promote the best-performing variant to release.
The full technical documentation — including per-step recipes, pitfalls, and the empirical data behind the r-percent sweep — is planned for release in the near future at https://github.com/JThomas-CoE/College-of-Experts-AI
Citation / Attribution
Research and engineering by JThomas-CoE.
Base model: Qwen/Qwen3.6-35B-A3B by the Qwen team. All specialist weights are derived from the publicly released checkpoint. Usage of the base model is subject to the Apache License, Version 2.0.
The project also publishes a gemma4-based CoE series under the same general approach.
License
Model weights: Apache 2.0 (inherited from Qwen/Qwen3.6-35B-A3B).
Code and tooling: PolyForm Noncommercial 1.0.0.
Commercial licensing: see LICENSE-COMMERCIAL.md.
- Downloads last month
- 29
4-bit
Model tree for JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m
Base model
Qwen/Qwen3.6-35B-A3B
Pull the model
# Download Lemonade from https://lemonade-server.ai/lemonade pull JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M