Instructions to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m",
	filename="coe-svdv2-physics-bb128-r31-q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M

Use Docker

docker model run hf.co/JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M

LM Studio
Jan

vLLM

How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M

Ollama
How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with Ollama:
```
ollama run hf.co/JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
```

Unsloth Studio

How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m to start chatting

How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with Docker Model Runner:
```
docker model run hf.co/JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M
```

Lemonade

How to use JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m:Q4_K_M

Run and chat with the model

lemonade run user.coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m-Q4_K_M

List all available models

lemonade list

coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m — College of Experts Specialist

This is the MMLU-Pro Physics specialist in the 7-model Qwen3.6 College of Experts release (mmlu_pro corpus, r=31 R-list admixture). It is one of: physics, chemistry, math, law, engineering, coding_python, coding_web. See the FAMILY_README for full context, the cross-domain r-percent findings, and the construction pipeline.

Base model: Qwen/Qwen3.6-35B-A3B (Apache 2.0) Architecture: 40 layers × 256 routed experts/layer + 1 shared expert/layer, TOP_K=8 routed + 1 shared per token CoE construction: prune each MoE layer from 256 → 128 routed experts (50% K-budget reduction), with optional R-list admixture at fraction r ∈ {0.25, 0.31} Quantization: Q4_K_M

⚠️ Beta Release — Safety Disclaimer

These models are beta releases and should be treated as research artifacts, not production-ready systems.

Expert surgery selects and retains domain-relevant experts based on activation patterns observed during profiling. The pruning pipeline is designed solely to create a coherent domain specialist — it has no mechanism to identify which experts contribute to model alignment, ethical reasoning, or safety guardrails. As a result, experts responsible for enforcing those behaviours may have been inadvertently removed during the surgery process.

Appropriate use of any model in the College of Experts family is the sole responsibility of the end user. The authors make no representation that these models retain the safety properties of the parent Qwen/Qwen3.6-35B-A3B model, and users should not rely on them as a substitute for models that have undergone safety evaluation.

⚠️ Critical Usage Note — Think-Off Mode

All models in this series must be used in thinking-off mode.

If you are using the Ollama API, pass "think": false in your request body. If you are accessing the model via a raw API (llama.cpp server, OpenAI-compatible endpoint, etc.) you must inject a closed thinking block at the start of the assistant turn:

messages = [
    {"role": "system",    "content": "Your system prompt here."},
    {"role": "user",      "content": "Your question here."},
    {"role": "assistant", "content": "<think></think>\n"},   # <-- required prefill
]

Why this is required: expert surgery retains 50% of the routed expert pool per layer, selecting experts that are maximally active on domain content and chain-of-thought reasoning. A side effect is that the loop-suppression experts — which activate on metacognitive closure signals near the end of a <think> block — do not have a concentrated domain-specific activation signature and are disproportionately pruned. In think-on mode, this causes the model to enter a reasoning loop that exhausts the token budget without producing a final answer. In extreme cases, the loop rate is 60–70% on hard questions.

The <think></think> prefill works by consuming the opening <think> token before generation starts, so the model sees its thinking as already complete and proceeds directly to answering. This is the mechanism used in all benchmarks reported here.

What think-off mode does not disable: Qwen3.6's chain-of-thought training is deeply ingrained. Even with the think block closed, the model produces brief inline reasoning interleaved with its answer — shorter and more linear than a full scratchpad, but present. All benchmark figures in this README are measured in this constrained-implicit-CoT mode.

Ollama Modelfile Template

{domain} is a placeholder. Replace it with the model's domain (e.g. physics, law, python coding) before creating the model.

FROM <model_path_or_ollama_tag>

PARAMETER temperature 0.6
PARAMETER repeat_penalty 1.05
PARAMETER num_ctx 8192
PARAMETER num_predict 8192
PARAMETER think false

SYSTEM """
You are a {domain} expert assistant. Answer the user's question.
"""

Temperature 0.6 is strongly recommended. Higher temperatures (≥ 0.8) materially increase loop rates in think-off mode.

What Are These Models?

These models are produced by activation-directed expert surgery applied to the Qwen3.6-35B-A3B base. The surgery does not change any weight values — it prunes the routed-expert FFN weight tensors that are not part of the domain-specialist mask, then saves the result as a smaller GGUF. No post surgery fine tuning or training was done. No specific effort was taken to either preserve or remove the vision/image input capabilities native to the parent model, but cursory testing confirms image input capability does remain but this has not been tested to establish the extent of retained vision ability.

For each release candidate, the mask is built in three stages, each derived from a distinct data signal collected in two separate forward passes through the parent model.

Stage 1 — Per-entry fingerprints and the K95 representation space. A per-domain corpus is profiled by running each entry through the parent model with router hooks attached to every MoE layer. For each token, the router selects the top-8 experts and assigns softmax weights; the per-entry fingerprint is the per-layer activation vector (40, 256), sum-normalized per entry so that long entries do not dominate the aggregate. All 26 v2 domains' fingerprints are pooled and decomposed by uncentered SVD to produce the representation space combined_v2_basis_svd_raw_norm.pt at K95 retention (40 layers, K95 dims per layer, total 2,956 dimensions; per-layer range 41–113, mean 74). This is the space in which each expert has a unique point and in which the domain centroid lives.

Stage 2 — K=128 cosine mask and the backbone. For each candidate model domain, the entries that make up that domain's corpus are pooled (a working group like physical_sciences pools the chemistry + physics + engineering entries; the per-source centroids of an abandoned combined model would differ from the per-source centroids of the kept separate models, which is why the combined-domain attempt was abandoned). The per-layer domain centroid μ is the entry-mean of the pooled fingerprints. Projecting μ and each expert's column of Vt[:K95] into the K95-SVD space gives a 256-element ranking of all experts per layer by cosine similarity of the expert's direction to the centroid's direction — i.e., semantic closeness to the domain. The top-128 by this cosine ranking per layer is the K=128 cosine mask for that domain.

The backbone is the per-layer intersection of the K=128 cosine masks across all 21 (v1) or 26 (v2) domains — the experts that land in the top-128 of every domain. These are the universally co-activated experts and are locked during the R-list walk so they cannot be expelled by the admixture.

Stage 3 — R-list admixture (the r-percent step). Independently of Stage 1, a second forward pass through the parent model collects the per-domain per-token 3D histogram (40, 256, 8) of router selections and softmax-weight sums. From this we compute, for each (layer, expert):

R(l, e) = wm0  =  weight_sum[l, e, 0] / max(hist[l, e, 0], 1)

with R(l, e) = 0 if hist[l, e, 0] < 5 (the min_count=5 filter suppresses experts that have been the top-1 pick fewer than 5 times). R is a mean softmax weight — a router-commitment signal: how strongly the top-8 router commits to expert e on the occasions it picks e as the top-1. Specialists (rare-but-decisive picks with high softmax mass) outrank workhorses (frequent low-mass picks) on this signal even when the latter have higher raw selection counts.

The R-list is the experts with the highest R, walked in descending order. Starting from the K=128 cosine mask, the top r-budget R-list experts (not already in the mask and not in the backbone) are injected by evicting the lowest-cosine non-backbone experts one-for-one. The result is a (40, 128) int16 mask file.

Note on data sources. The fingerprints (Stage 1 input) and the 3D histograms (Stage 3 input) come from two independent forward passes through the parent model. The fingerprints are not derivable from the histograms: the fingerprints are per-entry sum-normalized and preserve the per-entry routing pattern, while the histograms are per-domain per-token aggregates that discard entry boundaries. Each data source carries information the other does not.

A GGUF is built by slicing the parent GGUF's routed-expert weight tensors to keep only the K=128 selected experts per layer. Attention, router, shared expert, embeddings, output head, and norms are all preserved at full count. The result is a smaller GGUF with the same per-token activation count as the parent (TOP_K=8 routed + 1 shared = 9 experts fire per token, regardless of pool size).

Memory Efficiency

The parent GGUF is 24 GB at q4_K_M; each specialist is 13 GB at q4_K_M — a 46% disk-footprint reduction. Using the same 13/24 disk-footprint ratio to scale the parent's 35B total parameter count gives the specialists 19B total parameters, while the active parameter count stays at 3B (TOP_K=8 + 1 shared, unchanged from the parent). Throughput (tokens/second) is identical between the specialist and the parent at the same quantization because the number of expert weight tensors that participate in each forward pass is the same. The saving is purely in VRAM residency — half the routed expert weight tensors simply do not need to be loaded.

	Disk (q4_K_M)	Total params	Active params	Reduction vs parent
Parent `Qwen3.6-35B-A3B`	24 GB	35B	3B	—
All CoE release candidates	13 GB	19B	3B	46%

With adequate but modest context of less than 8k tokens this is runnable on 16GB VRAM excluding any extrinsic VRAM loads.

All figures directly measured in Ollama.

Release Set (2026-06-16)

Seven domain-specialist CoE models derived from Qwen/Qwen3.6-35B-A3B. All non-coding domains are benchmarked on MMLU-Pro off1 (the standard held-out split, approximately 100 QIDs per domain depending on the granular subject). The python and web coding models are benchmarkd using HumanEval and a custom web coding bench respectively. The CoE result is compared against the parent on the same QID set.

Domain	CoE (HF repo)	CoE (ollama tag)	K	r	GGUF	n	CoE acc%	Parent acc% ‡	Gap
physics	`coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m`	`coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b:q4_k_m`	128	31	12.28 GB	130	86.92%	86.92%	+0.00
chemistry	`coe-qwen3.6-mmlu_pro-chemistry-r31-19b-a3b-q4_k_m`	`coe-qwen3.6-mmlu_pro-chemistry-r31-19b-a3b:q4_k_m`	128	31	12.5 GB	114	89.47%	78.95% / 85.09%	+10.52
math	`coe-qwen3.6-mmlu_pro-math-r31-19b-a3b-q4_k_m`	`coe-qwen3.6-mmlu_pro-math-r31-19b-a3b:q4_k_m`	128	31	13.19 GB	135	88.15%	91.85%	−3.70
law	`coe-qwen3.6-mmlu_pro-law-r25-19b-a3b-q4_k_m`	`coe-qwen3.6-mmlu_pro-law-r25-19b-a3b:q4_k_m`	128	25	13.19 GB	110	67.27%	68.18%	−0.91
engineering	`coe-qwen3.6-mmlu_pro-engineering-r25-19b-a3b-q4_k_m`	`coe-qwen3.6-mmlu_pro-engineering-r25-19b-a3b:q4_k_m`	128	25	13.19 GB	97	74.23%	64.95% / 81.44%	+9.30
coding (HumanEval)	`coe-qwen3.6-hc-coding_python-r25-19b-a3b-q4_k_m`	`coe-qwen3.6-hc-coding_python-r25-19b-a3b:q4_k_m`	128	25	12.28 GB	164	88.11% (2-seed mean)	90.55% (2-seed mean)	−2.44
coding (web)	`coe-qwen3.6-hc-coding_web-r25-19b-a3b-q4_k_m`	`coe-qwen3.6-hc-coding_web-r25-19b-a3b:q4_k_m`	128	25	13 GB	144	73.61%	67.36%	+6.25

‡ Parent accuracy is reported at two inference settings for engineering and chemistry (the only two domains re-benched at full compute):

8k ctx / 8k predict / think-off (the canonical release harness — the first number shown, 64.95% / 78.95%) — what the CoE numbers are also measured against, so the Gap column is directly comparable.
32k ctx / 24k predict / think-on (the "max-compute" parent configuration — the second number shown, 81.44% / 85.09%) — the upper bound on parent accuracy at the most generous inference settings tested in this study (see Parent at Full Compute below).

On both engineering and chemistry the CoE wins on accuracy at the canonical harness; on engineering, the parent at full compute does close the gap and slightly exceed the r=25 CoE (81.44% vs 74.23%); on chemistry, the CoE r=31 holds the lead against the parent at any tested compute configuration (89.47% vs the parent's 85.09% full-compute number).

Two CoE models beat the parent at the canonical harness: chemistry (+10.52 pp) and engineering (+9.30 pp). The chemistry gain is the largest positive gap of any release domain. Three CoE models tie or slightly trail the parent at the canonical harness: physics (tie), law (−0.91 pp), math (−3.70 pp). Two coding-specialist variants show mixed results: coding_python is slightly behind the parent on HumanEval (−2.44 pp, 2-seed mean), but coding_web beats the parent by +6.25 pp on the Web Visual Generation Suite.

The two CoE models that beat the parent also have a key operational advantage: they are ~46% smaller on disk, and at 8k / think-off the CoE is faster than the parent at any tested configuration (the parent's full-24k think-on mode takes ~3× longer per QID than the CoE's 8k think-off). See Pass@k Protocol below for the practical implication when the user does not have a reference answer.

Naming Convention

Each release candidate is a separate HuggingFace repo. The naming convention is coe-qwen3.6-{corpus}-{domain}-r{NN}-{size}-{active}-{quant}:

coe-qwen3.6- — College of Experts family, base model Qwen3.6
{corpus} — mmlu_pro (profiled on MMLU-Pro stratified subsets) for the 5 academic domains, or hc (hand-curated corpus, e.g. HumanEval / LCB-derived for coding) for the 2 coding specialists
{domain} — granular MMLU-Pro subject (or coding_python / coding_web for the coding specialists)
r{NN} — R-list admixture fraction. The optimal r is a per-domain hyperparameter: r=31 for the dense-signal STEM domains (physics, chemistry, math), r=25 for the flatter-signal domains (law, engineering, and the two coding specialists). See Cross-Domain r-Percent Findings below for the empirical basis.
{size}-{active} — total and active parameter counts in billions. All 7 release candidates are 19B total / 3B active (K=128 experts/layer out of 256, the standard 50% K-budget reduction; the bb128 token used in earlier internal tags is from a deprecated experimental backbone-density sweep and is not part of the release naming convention).
{quant} — q4_K_M quantization

The local ollama registry uses the same names with :q4_k_m instead of -q4_k_m (ollama syntax), and the bb128 segment is preserved in the internal ollama tag for traceability to the build script's CLI.

Per-Domain Benchmark Details

MMLU-Pro domains (5 release candidates)

Evaluation harness: an in-house MMLU-Pro pass@1 benchmark runner.

Multi-model × multi-domain support
Domain-appropriate system prompt (mathematician, lawyer, etc.)
Single-attempt, no retry chain except on loop detection
Loop detection: 3+ repeats triggers a single retry with a new seed

Settings: temperature 0.6, top_k 64, top_p 0.95, repeat_penalty 1.05, num_ctx 8192, num_predict 8192, seed 42 (fixed), think-OFF (assistant prefill <think></think>\n).

Coding domains (2 release candidates)

coding_python (HumanEval):

164 HumanEval problems, 2-seed mean accuracy (seeds 42 and 38762)
Temperature 0.6, top_k 64, top_p 0.95, repeat_penalty 1.05
CoE mean: 88.11% (88.41% seed=42, 87.80% seed=38762)
Parent mean: 90.55% (90.24% seed=42, 90.85% seed=38762)
Gap: −2.44 pp

coding_web (Web Visual Generation Suite):

144 prompts sampled from a 1-in-5 subset of the web coding corpus
DOM rendering + visual inspection + functional assertions
Dual-pass with live streaming, loop trap detection, multi-turn seed fallback
Settings: as for coding_python above
CoE: 73.61% (6 loop traps, 100% recovery rate via loop guard)
Parent: 67.36% (0 loop traps)
Gap: +6.25 pp

Cross-Domain r-Percent Findings

The r-percent R-list admixture fraction is a per-domain hyperparameter. Across the 7 release domains, the optimal r is consistently in the 25 to 31 percent range but the peak is definite but not particularly sharp. The grouping below reflects the empirical pattern observed in the per-domain r-sweeps rather than an a-priori taxonomy.

Domain category	Preferred r	Empirical pattern
physics, chemistry, math	r=31	dense R-signal: high-wm0 specialists dominate; more R-list admixture is monotonically better through r=31
engineering	r=25	medium R-signal; r=25 > r=31 on full 97 QIDs (engineering is a STEM domain in MMLU-Pro taxonomy but R-signal density tracks the per-domain corpus, not the field label)
law	r=25	flatter R-signal; r=25 is a local maximum, r=22 ties r=0 cosine, r=19 falls below r=0
python, web	r=25	flatter R-signal in the LCB-derived python corpus; r=25 > r=28 on full 164 HumanEval tasks

Parent at Full Compute

To verify that the engineering r=25 win and the chemistry r=31 lead are robust — and not artefacts of the canonical 8k-ctx think-off harness handicapping the parent — the parent was re-benched on engineering off1 (97 QIDs) and chemistry off1 (114 QIDs) at the most generous inference settings tested in this study:

Knob	Value
`num_ctx`	32768 (32k)
`num_predict`	24576 (24k)
`think`	true (native thinking)
assistant prefill	none (think block open)
temperature, top_k, top_p, repeat_penalty, seed	same as canonical (0.6, 64, 0.95, 1.05, 42)

The run was a post-import monkey-patch of the standard bench runner.

Results (single-shot, no pass@k)

Run	n	corr	noans	wrong	acc%	wall time	mean sec/q
Parent eng 32k ctx / 24k predict / think-on	97	79	6	12	81.44%	255 min	157.7 s
Parent chem 32k ctx / 24k predict / think-on	114	97	13	4	85.09%	327 min	172.1 s

Both numbers are substantially higher than the parent's canonical 8k-ctx / 8k-predict / think-off numbers (engineering: +16.49 pp, chemistry: +6.14 pp). The previous lower results in the literature on the parent model were a harness artifact — at the canonical 8k / 8k / think-off config, the parent's think block is also clipped at 8k tokens, and that clipping produces no-answers on hard questions that the model would have answered with more room.

Cross-domain comparison: CoE release vs Parent at full compute

Domain	CoE (8k think-off)	Parent 8k think-off	Parent 32k/24k think-on	Best
engineering	74.23%	64.95%	81.44%	Parent (full compute) by +7.21 pp
chemistry	89.47%	78.95%	85.09%	CoE r=31 by +4.38 pp

The CoE r=25 engineering release does not catch the parent at full compute; on chemistry, the CoE r=31 release still beats the parent at full compute by +4.38 pp (89.47% vs 85.09%). The CoE is also dramatically cheaper at inference — the parent's full-compute run takes 3.1× longer per QID than the CoE's 8k think-off run on engineering (158s vs 51s) and 2.5× longer on chemistry (172s vs 70s).

Practical reading. The CoE is the right "production default" on chemistry at 8k / think-off: 89.47% accuracy, 70 sec/QID, 13 GB. The parent at full compute is the right "maximum accuracy" mode when wall time is not the binding constraint — useful for benchmark reproduction or one-off deep analyses. The two are not in competition: they serve different operating points on the latency/accuracy trade-off.

Why the canonical 8k harness was handicapping the parent

The parent's 8k / 8k / think-off run produced 16 no-answers on chemistry (78.95% accuracy) and 27 no-answers on engineering (64.95% accuracy). A targeted rerun on the 16 chemistry no-answer QIDs at 24k predict recovered 11 of them, but did not retroactively change the 78.95% — those recovered answers were already counted in the 90 / 114 = 78.95% via the cumulative additive interpretation (see the ‡ footnote on the release table). The full-domain rerun at 32k / 24k / think-on confirms this: 97/114 = 85.09% on chemistry, +6.14 pp over canonical, with 13 no-answers instead of 16.

The bound on the parent at any tested compute configuration is therefore ~85% on chemistry, ~81% on engineering. The CoE at 8k / think-off matches or exceeds this on chemistry (89.47%) and trails on engineering (74.23% vs 81.44%).

Pass@k Protocol

For the engineering r=25 release, a pass@k sweep was run on the same 97 QIDs with the canonical 8k / think-off harness, retrying failed QIDs with fresh seeds from the schedule [42, 7, 13, 99, 17, 31, 57, 89, 123, 251]. Each QID is re-rolled only on failure (i.e. the runner exits early on the first correct attempt — the protocol the runner was designed for).

The pass@1 row of this table is the canonical engineering pass@1 number (72/97 = 74.23%) and is reused from the prior single-shot run; the runner only does the rounds 2..k for the unsolved QIDs. This avoids wasting ~99 min of compute re-running already-solved QIDs at attempt 1. Total wall-time budget: 255 min (matching the parent's full-compute single attempt engineering run, for like-for-like comparison).

Pass@k curve (corrected protocol — round 1 reused from prior run)

k	n_run	solved / n	pass@k	Δ from pass@(k-1)	round wall (min)	cum wall (min)
1	97	72/97	74.23%	—	104.2	104.2
2	25	79/97	81.44%	+7.21 pp	39.7	143.9
3	18	83/97	85.57%	+4.13 pp	27.6	171.6
4	14	86/97	88.66%	+3.09 pp	22.8	194.4
5	11	88/97	90.72%	+2.06 pp	21.2	215.5
6	9	90/97	92.78%	+2.06 pp	16.8	232.3
7	6	91/97	93.81%	+1.03 pp	10.5	242.9
8-10	(budget exhausted before next round)	91/97	93.81%	0	—	—

Time to reach pass@2: 143.9 min total (104.2 min for round 1 + 39.7 min for round 2 on the 25 unsolved QIDs). Average wall time per QID at pass@2: 89.0 s.

Time to reach plateau (pass@7 = pass@10 = 91/97): 242.9 min total (round 7 added 1 more QID at 10.5 min; rounds 8+ were cut by the 255 min budget). Beyond pass@7, no further QIDs are solvable on the remaining budget.

The 6 of the 97 QIDs that are "sticky" (never solved across all 1–7 attempts they were given) are: 11306, 11397, 11417, 11788, 11818, 12039. Of these, 11306, 11397, 11788 returned no-answer on most seeds; the other 3 produced wrong answers consistently.

Operating-point guidance for r=25. The cost-per-pp starts at ~~5.5 min/pp (pass@1→2) and roughly doubles by pass@5 (~~10.3 min/pp). For a single-shot answer, pass@1 = 74.23% is the baseline. For a self-consistent 2-of-3 protocol, pass@2 = 81.44% at 39.7 min of extra cost is the recommended operating point — the marginal cost between pass@2 and pass@7 is 99.0 min for +12.37 pp, which is rarely worth it for production traffic. The 99.0-min / 12.37-pp ratio (8.0 min/pp) is the upper-bound cost of "squeeze the last accuracy out of pass@k."

CoE r=25 + pass@k vs Parent single-shot (engineering off1, 97 QIDs)

Configuration	Time/QID	Accuracy	acc/min
CoE r=25 single-shot (8k / think-off) — pass@1	64.5 s	74.23%	1.151
CoE r=25 + pass@2	89.0 s	81.44%	0.549
CoE r=25 + pass@7 plateau	160.6 s	93.81%	0.386
Parent full-24k (32k / 24k / think-on)	158 s	81.44%	0.310

(Time/QID for the CoE rows is the per-QID mean across all attempted runs, including sticky QIDs that consumed the full attempt budget.)

The CoE r=25 with pass@2 hits 81.44% at 89.0 s/QID — tying the parent's full-compute 81.44% at 158 s/QID at 0.56× the wall time. The plateau number 93.81% is the upper bound on what the r=25 model can achieve with unlimited retries (242.9 min total); the practical operating point is 81.44% at pass@2.

For real-world deployment without a reference answer, a "2-of-3 self-consistent" protocol on the CoE r=25 (run twice; if the two answers disagree, tag as "low confidence" and either escalate or re-ask) gives ~91% confidence on the served answer at 89.0 s/QID (the per-QID cost at pass@2 — the median of the two runs). See the CoE agent cascade writeup for the full privacy-and-cost analysis.

CoE Agent: Cascade + Privacy

The CoE agent framework (updated version planned for release) wraps the CoE models in a T0/T1/T2 cascade for real-world deployment:

T0 — Local CoE r=25 first attempt (60s, 78% accurate, free, no network)
T1 — Local retry with parser-aware hint (110s, +6% marginal accuracy, free, no network)
T2 — Online SOTA escalation with cover-traffic padding (1 real + N-1 decoys per escalation) and SOTA model rotation (round-robin across Claude / GPT / Gemini/ GLM/ other SOTA tier)

The cascade numbers for engineering off1: 84.5% stay purely local, 15.5% escalate. The cost of the cascade at standard 4× padding + 3-way rotation is $1.86 per 100 QIDs at 1.29% effective privacy leakage — vs $30+ per 100 QIDs at 100% leakage for the all-online alternative using approximate API fees for proprietary models using the next to the latest tier of proprietary models.(ie 2 million output tokens at $15 per million)

Construction Pipeline

For each release candidate, the build follows the pipeline below. Steps 1 and 2 are two separate forward passes through the parent model — they are not collected simultaneously and produce distinct data products that are not derivable from one another:

Forward pass #1 — profile histograms. The domain corpus is run through the parent model with router hooks on every MoE layer. The result is a per-domain 3D activation histogram of shape (40, 256, 8) over (layer, expert, rank), plus a matching (40, 256, 8) tensor of softmax-weight sums. This is the only input to the R-list (Step 4).
Forward pass #2 — entry fingerprints. All domains/categories are profiled again, this time recording activation per entry (sum-normalized per entry per layer). This is not derivable from the histograms — the fingerprints preserve per-entry routing patterns that the per-domain per-token histogram aggregation discards. The fingerprints are pooled across all 26 v2 domains and decomposed by uncentered SVD to build the K95 representation space (combined_v2_basis_svd_raw_norm.pt). This space is the input to the cosine K=128 mask (Step 3).
Build a K=128 cosine mask by SVDv2 cosine similarity to the domain centroid in K95 space, intersected with a backbone of experts that survive across all profiled domains. The cosine is computed between the direction of the per-layer domain centroid (projected from the pooled fingerprint entry-mean) and the direction of each expert's column of Vt[:K95]; the top-128 by cosine per layer is the K=128 cosine mask.
Build an r-percent mask starting from the K=128 cosine mask, removing the backbone to obtain a swap-eligible set, and admixing the top R-budget experts (ranked by the R / wm0 signal — mean softmax weight at rank-0, with a min_count=5 filter to suppress unobserved experts) at fraction r ∈ {0.25, 0.31}. The result is a (40, 128) int16 mask file.
Build the GGUF by slicing the parent GGUF's routed-expert FFN weight tensors to keep only the K=128 selected experts per layer. Attention, router, shared expert, embeddings, output head, and norms are all preserved at full count. The metadata patch updates the expert count from 256 to 128.
Benchmark the new model on MMLU-Pro off1 (or HumanEval / Web Visual for coding domains) at temperature 0.6, num_ctx 8192, num_predict 8192, think-OFF, seed 42.
Validate the r-percent choice by sweeping r ∈ {0.25, 0.28, 0.31, 0.34} on the new domain; promote the best-performing variant to release.

The full technical documentation — including per-step recipes, pitfalls, and the empirical data behind the r-percent sweep — is planned for release in the near future at https://github.com/JThomas-CoE/College-of-Experts-AI

Citation / Attribution

Research and engineering by JThomas-CoE.

Base model: Qwen/Qwen3.6-35B-A3B by the Qwen team. All specialist weights are derived from the publicly released checkpoint. Usage of the base model is subject to the Apache License, Version 2.0.

The project also publishes a gemma4-based CoE series under the same general approach.

License

Model weights: Apache 2.0 (inherited from Qwen/Qwen3.6-35B-A3B). Code and tooling: PolyForm Noncommercial 1.0.0. Commercial licensing: see LICENSE-COMMERCIAL.md.

Downloads last month: 29

GGUF

Model size

19B params

Architecture

qwen35moe

Hardware compatibility

4-bit

Model tree for JThomas-CoE/coe-qwen3.6-mmlu_pro-physics-r31-19b-a3b-q4_k_m

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(510)

this model