GLM-5.2-Demolition-q4a4-soul-MLX / IMPLEMENTATION_PLAN.md
philipjohnbasile's picture
Upload IMPLEMENTATION_PLAN.md with huggingface_hub
8efa69b verified
|
Raw
History Blame Contribute Delete
3.71 kB

Implementation Plan β€” acting on the 9-round SOTA research (June 2026)

Basis: research/swappable_adapters_sota.md (9 deep rounds). Rounds 5–9 mostly confirmed our builds (prune > merge, verifier-mesh over self-reward, on-policy distillation, MLX kernel-fusion, DSA top-2048, contamination-checking) β€” the project is SOTA-aligned. This plan extracts the genuinely-new levers and sequences them around the single GPU.

Sequencing constraint: one GPU. The soul2 verdict (GREEN β€” HumanEval held 116/164 β€” shipping via the autonomous driver) goes first; GPU-bound work queues behind it. GPU-free code changes happen now, in parallel.


TIER 1 β€” now (GPU-free / quick, highest value-per-effort)

  1. Overthinking fix β€” a concise-CoT mode (reasoning-token budget + "think tersely, end with the answer" directive) in the serve / decoding path. Why: our 5–8K-token CoT overthinks β€” the 2026 literature shows that hurts accuracy + calibration, it's slow at 11–14 tok/s, and it's what broke the GSM8K answer-parser. Test: token-count drop + a clean final-answer extraction (no GSM8K-number-chase β€” just verify the parse).
  2. REAP logit-renorm fix β€” renormalize the top-k router logits to sum to 1 in the prune script (the March-2026 / ICLR-2026 REAP update). Why: modest free accuracy gain on the next prune. Write now, runs later.
  3. CallSieve agentic-RAG upgrade (user-greenlit) β€” expose hierarchical retrieval as tools (keyword + semantic + chunk-read, the A-RAG pattern) + iterative multi-hop, instead of one-shot fetch. Look at the /Users/pjb/git/callsieve repo first, then apply minimally.
  4. mlx-optiq probe β€” pip install mlx-optiq + a load-test script (mount our q3a4 base + a rank-16 adapter, confirm per-request hot-swap). Why: it solves the factory's instant-swap (no 3-min reload). Validate when the GPU frees.

TIER 2 β€” GPU-bound (queue behind the soul2 ship)

  1. soul2 ship β€” in flight (driver; GREEN verdict). The new core soul.
  2. Saliency-dynamic quant (#59) β€” protect the salient + structurally-sensitive experts and early layers at 4-bit+, rest at 3-bit. Why: the design degeneration is Computation Collapse (round 4–5 diagnosis) β€” only fixable by mixed-precision on the critical experts, not decoding tricks. The big quality win (recovers design).
  3. Specialty heals β€” the ~250 masters-gold examples (gamedev / legacy[old+modern] / cyber / pentest / science / perfumery / factory-router) β†’ adapters, using MoE-Sieve placement (LoRA only the top-25% routed experts + attention β†’ 70–73% smaller, more hot-swappable) + iw-SFT (importance-weighted curated SFT).
  4. KV-quant (TurboQuant/KVQuant-style) β€” harden the serve against the 118 GB long-gen self-bound.

TIER 3 β€” eval & quality

  1. Real-task benches β€” weight FeatureBench / Terminal-Bench / LongCLI-Bench over the saturating SWE-bench (#62).
  2. Contamination-resistant eval β€” LiveCodeBench / LiveBench (#66); our 0/0/0.4% near-dup checking is validated.
  3. Lean-OPD (self-teacher critiques the student's Lean attempt) for the prover (#27–31); agentic-RAG for live-docs; security (LlamaFirewall / TRUSTDESC tool-poisoning guard + a "Zombie-Agent" check on the self-healing flywheel).

First actions (this session)

  • Tier-1 #1 (overthinking) and #3 (CallSieve) β€” GPU-free, start now.
  • Queue Tier-2 (#59 saliency-quant, specialty heals) behind the soul2 verdict via the driver.
  • The 2-bit BitNet family (#57–58) stays an experiment β€” BitNet needs QAT, our PTQ won't match it.