philipjohnbasile commited on
Commit
d72ec2e
Β·
verified Β·
1 Parent(s): 748383c

Upload research/quant_heal_speed_sota_10rounds.md with huggingface_hub

Browse files
research/quant_heal_speed_sota_10rounds.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 10-Round Deep Research: how to make the demolished GLM-5.2 better (June 2026)
2
+
3
+ Goal: "constantly research how to make this better." 10 rounds, ~40 searches, every claim sourced.
4
+ Context: v5c (saliency affine, 6-bit heads) measured at 99 GB / **5.3 tok/s** (1.47Γ— v4) / coherent β€” the
5
+ working base. The shipped model stays q3a4-v4+soul2 until a heal validates v5c.
6
+
7
+ ## The 10 rounds (one line each)
8
+ 1. **Bit-allocation** β€” KL-guided (SliM-LLM) is optimal; use an **imatrix** (⟨a²⟩ activation importance), allocate **per-tensor** (ffn_down protected, ffn_up/gate aggressive), keep **router + shared experts high-precision**.
9
+ 2. **Heal/recovery** β€” **QAD** (KL-distill teacherβ†’student) > SFT; **SPEQ** makes the student its own teacher (no FP teacher needed); **Recover-LoRA** = 80–95 % 2-bit recovery from 10k synthetic; rank 64–128, lr 3e-4, Ξ±=2r; anti-repetition data fixes degeneration.
10
+ 3. **Speed** β€” NVFP4 is **0 speedup in mlx** (Ollama-only); the real levers are a custom kernel or trellis serving.
11
+ 4. **Frontier quant** β€” **QTIP (trellis + incoherence)** is SOTA but CUDA-only; **rotation (SpinQuant > QuaRot)** suppresses outliers.
12
+ 5. **Rotation/incoherence** β€” Hadamard rotation **before** quant β†’ outlier-free β†’ fixes 3-bit collapse at the quant level; **feasible on MLX** (PolarQuant); makes the forward *faster*. **Biggest quant lever.**
13
+ 6. **Serving** β€” **PonyExl3 (EXL3 trellis on MLX)** measured **152 tok/s on a 27B / M5 Max**, MoE+LoRA+spec support; **Self-Speculative MoE = 3.72Γ—** (our "spec dead" was naive-spec only).
14
+ 7. **Agentic quality** β€” **the SCAFFOLD matters MORE than the model** (11–50 pt swings); **Agent-RLVR = +13–18 pts SWE-bench**; small model + elite scaffold beats big model.
15
+ 8. **Context** β€” our **DSA *is* DeepSeek's** (1M ctx @ 9.62 GiB KV); **int4 KV-cache is FREE on Apple Silicon** (outruns fp16); hybrid retrieval+long-ctx (our CallSieve+DSA).
16
+ 9. **Reliability** β€” our stack (100 % constrained tool-JSON, verified decode, verifier mesh) **IS the validated SOTA**; upgrades = SCoRe self-correction RL, GenPRM, multi-verifier.
17
+ 10. **Compression** — REAP (ours) is SOTA (+ Mar-2026 renorm fix); order **P→KD→Q** is best; Unsloth Dynamic 2.0 beats imatrix+QAT on KL.
18
+
19
+ ## PRIORITIZED ACTION PLAN (impact Γ— feasibility)
20
+
21
+ ### πŸ”΄ Tier 1 β€” highest impact, do first
22
+ 1. **Hadamard incoherence/rotation before quant** (R5) β€” outlier-free weights β†’ fixes the 3-bit collapse at the *quant* level + faster forward. Prototype on MLX (PolarQuant-style). *The single biggest quality lever found.*
23
+ 2. **QAD heal** (R2) β€” KL-distill from the **mxfp4 teacher** (or SPEQ self-teacher) into v5c; recovers the 2-bit loss (80–95 %) and the CKA term fights collapse. Replaces plain-SFT heal. *The real quality lever (the collapse fix is the heal, not the quant β€” measured).*
24
+ 3. **imatrix + per-tensor saliency** (R1) β€” replace depth-U with ⟨a²⟩ importance; protect `ffn_down` + router + shared experts; aggressive 2-bit on `ffn_up/gate`. Better allocation at the same size.
25
+ 4. **KL-eval (#60)** (R1/R10) β€” the gold metric to *validate* every change (Unsloth's bar). Build + run on v4/v5b/v5c. *Without this we're guessing.*
26
+
27
+ ### 🟠 Tier 2 β€” big wins, more effort
28
+ 5. **EXL3/PonyExl3 trellis serving** (R6) β€” better quant (trellis) + fused in-kernel decode, measured fast on M5. Evaluate converting our model.
29
+ 6. **int4 KV-cache quant** (R8) β€” free on Apple Silicon, 3Γ— KV compression. Wire into the serve (supersedes #86 SSD-offload).
30
+ 7. **Self-Speculative MoE / layer-skip** (R6) β€” 3.72Γ— decode, no draft model. Re-open #69 with the MoE-specific method.
31
+ 8. **Long-context via DSA + YaRN** (R8) β€” our DSA already does 1M cheaply; extend RoPE (400–600 steps). Turns our weakest axis into a strength.
32
+
33
+ ### 🟒 Tier 3 β€” strategic (compounding)
34
+ 9. **Agent-RLVR** (R7) β€” execution-reward RL with guidance β†’ +13–18 pts SWE-bench. Upgrade #18.
35
+ 10. **Double down on the SCAFFOLD** (R7) β€” the harness drives 11–50 pts; our verify-everything is the right bet β€” invest there over chasing raw model size.
36
+ 11. **Compression order P→KD→Q** (R10) — for any future demolition, heal the pruned model *before* quantizing.
37
+
38
+ ## What this changes immediately
39
+ - The **#59 heal** (next GPU job) should be **QAD (KL-distill), not plain SFT**, with anti-repetition data β€” Tier-1 #2.
40
+ - Before the heal, **prototype Hadamard rotation** (Tier-1 #1) β€” it may fix the collapse cheaper than the heal.
41
+ - **Build #60 KL-eval first** β€” so we measure, not guess (the lesson of the nvfp4 mistake).
42
+
43
+ *Sources: per-round inline above β€” Unsloth Dynamic-v2, QTIP (2406.11235), SpinQuant (2405.16406), PolarQuant (2603.29078), QAD (2601.20088), SPEQ, Recover-LoRA (2606.04238), PonyExl3, SS-MoE (3792218), Agent-RLVR (2506.11425), scaffold-taxonomy (2604.03515), DeepSeek-V4 DSA, int4-KV (2605.05699), XGrammar-2 (2601.04426), REAP (2510.13999), compression-order (2603.18426).*