Text Generation
MLX
Safetensors
English
glm_moe_dsa
apple-silicon
Mixture of Experts
pruned
quantized
soul-targeted
agentic
local-agent
glm
conversational
Eval Results (legacy)
4-bit precision
Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX
Run Hermes
hermes
- MLX LM
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
Upload research/quant_heal_speed_sota_10rounds.md with huggingface_hub
Browse files
research/quant_heal_speed_sota_10rounds.md
ADDED
|
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 10-Round Deep Research: how to make the demolished GLM-5.2 better (June 2026)
|
| 2 |
+
|
| 3 |
+
Goal: "constantly research how to make this better." 10 rounds, ~40 searches, every claim sourced.
|
| 4 |
+
Context: v5c (saliency affine, 6-bit heads) measured at 99 GB / **5.3 tok/s** (1.47Γ v4) / coherent β the
|
| 5 |
+
working base. The shipped model stays q3a4-v4+soul2 until a heal validates v5c.
|
| 6 |
+
|
| 7 |
+
## The 10 rounds (one line each)
|
| 8 |
+
1. **Bit-allocation** β KL-guided (SliM-LLM) is optimal; use an **imatrix** (β¨aΒ²β© activation importance), allocate **per-tensor** (ffn_down protected, ffn_up/gate aggressive), keep **router + shared experts high-precision**.
|
| 9 |
+
2. **Heal/recovery** β **QAD** (KL-distill teacherβstudent) > SFT; **SPEQ** makes the student its own teacher (no FP teacher needed); **Recover-LoRA** = 80β95 % 2-bit recovery from 10k synthetic; rank 64β128, lr 3e-4, Ξ±=2r; anti-repetition data fixes degeneration.
|
| 10 |
+
3. **Speed** β NVFP4 is **0 speedup in mlx** (Ollama-only); the real levers are a custom kernel or trellis serving.
|
| 11 |
+
4. **Frontier quant** β **QTIP (trellis + incoherence)** is SOTA but CUDA-only; **rotation (SpinQuant > QuaRot)** suppresses outliers.
|
| 12 |
+
5. **Rotation/incoherence** β Hadamard rotation **before** quant β outlier-free β fixes 3-bit collapse at the quant level; **feasible on MLX** (PolarQuant); makes the forward *faster*. **Biggest quant lever.**
|
| 13 |
+
6. **Serving** β **PonyExl3 (EXL3 trellis on MLX)** measured **152 tok/s on a 27B / M5 Max**, MoE+LoRA+spec support; **Self-Speculative MoE = 3.72Γ** (our "spec dead" was naive-spec only).
|
| 14 |
+
7. **Agentic quality** β **the SCAFFOLD matters MORE than the model** (11β50 pt swings); **Agent-RLVR = +13β18 pts SWE-bench**; small model + elite scaffold beats big model.
|
| 15 |
+
8. **Context** β our **DSA *is* DeepSeek's** (1M ctx @ 9.62 GiB KV); **int4 KV-cache is FREE on Apple Silicon** (outruns fp16); hybrid retrieval+long-ctx (our CallSieve+DSA).
|
| 16 |
+
9. **Reliability** β our stack (100 % constrained tool-JSON, verified decode, verifier mesh) **IS the validated SOTA**; upgrades = SCoRe self-correction RL, GenPRM, multi-verifier.
|
| 17 |
+
10. **Compression** β REAP (ours) is SOTA (+ Mar-2026 renorm fix); order **PβKDβQ** is best; Unsloth Dynamic 2.0 beats imatrix+QAT on KL.
|
| 18 |
+
|
| 19 |
+
## PRIORITIZED ACTION PLAN (impact Γ feasibility)
|
| 20 |
+
|
| 21 |
+
### π΄ Tier 1 β highest impact, do first
|
| 22 |
+
1. **Hadamard incoherence/rotation before quant** (R5) β outlier-free weights β fixes the 3-bit collapse at the *quant* level + faster forward. Prototype on MLX (PolarQuant-style). *The single biggest quality lever found.*
|
| 23 |
+
2. **QAD heal** (R2) β KL-distill from the **mxfp4 teacher** (or SPEQ self-teacher) into v5c; recovers the 2-bit loss (80β95 %) and the CKA term fights collapse. Replaces plain-SFT heal. *The real quality lever (the collapse fix is the heal, not the quant β measured).*
|
| 24 |
+
3. **imatrix + per-tensor saliency** (R1) β replace depth-U with β¨aΒ²β© importance; protect `ffn_down` + router + shared experts; aggressive 2-bit on `ffn_up/gate`. Better allocation at the same size.
|
| 25 |
+
4. **KL-eval (#60)** (R1/R10) β the gold metric to *validate* every change (Unsloth's bar). Build + run on v4/v5b/v5c. *Without this we're guessing.*
|
| 26 |
+
|
| 27 |
+
### π Tier 2 β big wins, more effort
|
| 28 |
+
5. **EXL3/PonyExl3 trellis serving** (R6) β better quant (trellis) + fused in-kernel decode, measured fast on M5. Evaluate converting our model.
|
| 29 |
+
6. **int4 KV-cache quant** (R8) β free on Apple Silicon, 3Γ KV compression. Wire into the serve (supersedes #86 SSD-offload).
|
| 30 |
+
7. **Self-Speculative MoE / layer-skip** (R6) β 3.72Γ decode, no draft model. Re-open #69 with the MoE-specific method.
|
| 31 |
+
8. **Long-context via DSA + YaRN** (R8) β our DSA already does 1M cheaply; extend RoPE (400β600 steps). Turns our weakest axis into a strength.
|
| 32 |
+
|
| 33 |
+
### π’ Tier 3 β strategic (compounding)
|
| 34 |
+
9. **Agent-RLVR** (R7) β execution-reward RL with guidance β +13β18 pts SWE-bench. Upgrade #18.
|
| 35 |
+
10. **Double down on the SCAFFOLD** (R7) β the harness drives 11β50 pts; our verify-everything is the right bet β invest there over chasing raw model size.
|
| 36 |
+
11. **Compression order PβKDβQ** (R10) β for any future demolition, heal the pruned model *before* quantizing.
|
| 37 |
+
|
| 38 |
+
## What this changes immediately
|
| 39 |
+
- The **#59 heal** (next GPU job) should be **QAD (KL-distill), not plain SFT**, with anti-repetition data β Tier-1 #2.
|
| 40 |
+
- Before the heal, **prototype Hadamard rotation** (Tier-1 #1) β it may fix the collapse cheaper than the heal.
|
| 41 |
+
- **Build #60 KL-eval first** β so we measure, not guess (the lesson of the nvfp4 mistake).
|
| 42 |
+
|
| 43 |
+
*Sources: per-round inline above β Unsloth Dynamic-v2, QTIP (2406.11235), SpinQuant (2405.16406), PolarQuant (2603.29078), QAD (2601.20088), SPEQ, Recover-LoRA (2606.04238), PonyExl3, SS-MoE (3792218), Agent-RLVR (2506.11425), scaffold-taxonomy (2604.03515), DeepSeek-V4 DSA, int4-KV (2605.05699), XGrammar-2 (2601.04426), REAP (2510.13999), compression-order (2603.18426).*
|