# Implementation Plan — acting on the 9-round SOTA research (June 2026) **Basis:** `research/swappable_adapters_sota.md` (9 deep rounds). Rounds 5–9 mostly *confirmed* our builds (prune > merge, verifier-mesh over self-reward, on-policy distillation, MLX kernel-fusion, DSA top-2048, contamination-checking) — the project is **SOTA-aligned**. This plan extracts the genuinely-new levers and sequences them around the **single GPU**. **Sequencing constraint:** one GPU. The **soul2 verdict** (GREEN — HumanEval held 116/164 — shipping via the autonomous driver) goes first; GPU-bound work queues behind it. **GPU-free code changes happen now, in parallel.** --- ## TIER 1 — now (GPU-free / quick, highest value-per-effort) 1. **Overthinking fix** — a concise-CoT mode (reasoning-token budget + "think tersely, end with the answer" directive) in the serve / decoding path. **Why:** our 5–8K-token CoT *overthinks* — the 2026 literature shows that hurts accuracy + calibration, it's slow at 11–14 tok/s, and it's what broke the GSM8K answer-parser. **Test:** token-count drop + a clean final-answer extraction (no GSM8K-number-chase — just verify the parse). 2. **REAP logit-renorm fix** — renormalize the top-k router logits to sum to 1 in the prune script (the March-2026 / ICLR-2026 REAP update). **Why:** modest free accuracy gain on the next prune. Write now, runs later. 3. **CallSieve agentic-RAG upgrade** *(user-greenlit)* — expose **hierarchical retrieval as tools** (keyword + semantic + chunk-read, the A-RAG pattern) + iterative multi-hop, instead of one-shot fetch. **Look at the `/Users/pjb/git/callsieve` repo first**, then apply minimally. 4. **mlx-optiq probe** — `pip install mlx-optiq` + a load-test script (mount our q3a4 base + a rank-16 adapter, confirm per-request hot-swap). **Why:** it solves the factory's instant-swap (no 3-min reload). Validate when the GPU frees. ## TIER 2 — GPU-bound (queue behind the soul2 ship) 5. **soul2 ship** — in flight (driver; GREEN verdict). The new core soul. 6. **Saliency-dynamic quant (#59)** — protect the salient + structurally-sensitive experts and **early layers at 4-bit+**, rest at 3-bit. **Why:** the design degeneration is **Computation Collapse** (round 4–5 diagnosis) — *only* fixable by mixed-precision on the critical experts, not decoding tricks. **The big quality win** (recovers design). 7. **Specialty heals** — the ~250 masters-gold examples (gamedev / legacy[old+modern] / cyber / pentest / science / perfumery / factory-router) → adapters, using **MoE-Sieve placement** (LoRA only the top-25% routed experts + attention → 70–73% smaller, more hot-swappable) + **iw-SFT** (importance-weighted curated SFT). 8. **KV-quant** (TurboQuant/KVQuant-style) — harden the serve against the 118 GB long-gen self-bound. ## TIER 3 — eval & quality 9. **Real-task benches** — weight FeatureBench / Terminal-Bench / LongCLI-Bench over the **saturating** SWE-bench (#62). 10. **Contamination-resistant eval** — LiveCodeBench / LiveBench (#66); our 0/0/0.4% near-dup checking is validated. 11. **Lean-OPD** (self-teacher critiques the student's Lean attempt) for the prover (#27–31); **agentic-RAG** for live-docs; **security** (LlamaFirewall / TRUSTDESC tool-poisoning guard + a "Zombie-Agent" check on the self-healing flywheel). --- ## First actions (this session) - **Tier-1 #1 (overthinking)** and **#3 (CallSieve)** — GPU-free, start now. - Queue Tier-2 (#59 saliency-quant, specialty heals) behind the soul2 verdict via the driver. - The 2-bit BitNet family (#57–58) stays an *experiment* — BitNet needs QAT, our PTQ won't match it.