Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX

Run Hermes

hermes

MLX LM

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

philipjohnbasile commited on 13 days ago

Commit

d72ec2e

verified ·

1 Parent(s): 748383c

Upload research/quant_heal_speed_sota_10rounds.md with huggingface_hub

Browse files

Files changed (1) hide show

research/quant_heal_speed_sota_10rounds.md +43 -0

research/quant_heal_speed_sota_10rounds.md ADDED Viewed

	@@ -0,0 +1,43 @@

+# 10-Round Deep Research: how to make the demolished GLM-5.2 better (June 2026)
+Goal: "constantly research how to make this better." 10 rounds, ~40 searches, every claim sourced.
+Context: v5c (saliency affine, 6-bit heads) measured at 99 GB / **5.3 tok/s** (1.47× v4) / coherent — the
+working base. The shipped model stays q3a4-v4+soul2 until a heal validates v5c.
+## The 10 rounds (one line each)
+1. **Bit-allocation** — KL-guided (SliM-LLM) is optimal; use an **imatrix** (⟨a²⟩ activation importance), allocate **per-tensor** (ffn_down protected, ffn_up/gate aggressive), keep **router + shared experts high-precision**.
+2. **Heal/recovery** — **QAD** (KL-distill teacher→student) > SFT; **SPEQ** makes the student its own teacher (no FP teacher needed); **Recover-LoRA** = 80–95 % 2-bit recovery from 10k synthetic; rank 64–128, lr 3e-4, α=2r; anti-repetition data fixes degeneration.
+3. **Speed** — NVFP4 is **0 speedup in mlx** (Ollama-only); the real levers are a custom kernel or trellis serving.
+4. **Frontier quant** — **QTIP (trellis + incoherence)** is SOTA but CUDA-only; **rotation (SpinQuant > QuaRot)** suppresses outliers.
+5. **Rotation/incoherence** — Hadamard rotation **before** quant → outlier-free → fixes 3-bit collapse at the quant level; **feasible on MLX** (PolarQuant); makes the forward *faster*. **Biggest quant lever.**
+6. **Serving** — **PonyExl3 (EXL3 trellis on MLX)** measured **152 tok/s on a 27B / M5 Max**, MoE+LoRA+spec support; **Self-Speculative MoE = 3.72×** (our "spec dead" was naive-spec only).
+7. **Agentic quality** — **the SCAFFOLD matters MORE than the model** (11–50 pt swings); **Agent-RLVR = +13–18 pts SWE-bench**; small model + elite scaffold beats big model.
+8. **Context** — our **DSA *is* DeepSeek's** (1M ctx @ 9.62 GiB KV); **int4 KV-cache is FREE on Apple Silicon** (outruns fp16); hybrid retrieval+long-ctx (our CallSieve+DSA).
+9. **Reliability** — our stack (100 % constrained tool-JSON, verified decode, verifier mesh) **IS the validated SOTA**; upgrades = SCoRe self-correction RL, GenPRM, multi-verifier.
+10. **Compression** — REAP (ours) is SOTA (+ Mar-2026 renorm fix); order **P→KD→Q** is best; Unsloth Dynamic 2.0 beats imatrix+QAT on KL.
+## PRIORITIZED ACTION PLAN (impact × feasibility)
+### 🔴 Tier 1 — highest impact, do first
+1. **Hadamard incoherence/rotation before quant** (R5) — outlier-free weights → fixes the 3-bit collapse at the *quant* level + faster forward. Prototype on MLX (PolarQuant-style). *The single biggest quality lever found.*
+2. **QAD heal** (R2) — KL-distill from the **mxfp4 teacher** (or SPEQ self-teacher) into v5c; recovers the 2-bit loss (80–95 %) and the CKA term fights collapse. Replaces plain-SFT heal. *The real quality lever (the collapse fix is the heal, not the quant — measured).*
+3. **imatrix + per-tensor saliency** (R1) — replace depth-U with ⟨a²⟩ importance; protect `ffn_down` + router + shared experts; aggressive 2-bit on `ffn_up/gate`. Better allocation at the same size.
+4. **KL-eval (#60)** (R1/R10) — the gold metric to *validate* every change (Unsloth's bar). Build + run on v4/v5b/v5c. *Without this we're guessing.*
+### 🟠 Tier 2 — big wins, more effort
+5. **EXL3/PonyExl3 trellis serving** (R6) — better quant (trellis) + fused in-kernel decode, measured fast on M5. Evaluate converting our model.
+6. **int4 KV-cache quant** (R8) — free on Apple Silicon, 3× KV compression. Wire into the serve (supersedes #86 SSD-offload).
+7. **Self-Speculative MoE / layer-skip** (R6) — 3.72× decode, no draft model. Re-open #69 with the MoE-specific method.
+8. **Long-context via DSA + YaRN** (R8) — our DSA already does 1M cheaply; extend RoPE (400–600 steps). Turns our weakest axis into a strength.
+### 🟢 Tier 3 — strategic (compounding)
+9. **Agent-RLVR** (R7) — execution-reward RL with guidance → +13–18 pts SWE-bench. Upgrade #18.
+10. **Double down on the SCAFFOLD** (R7) — the harness drives 11–50 pts; our verify-everything is the right bet — invest there over chasing raw model size.
+11. **Compression order P→KD→Q** (R10) — for any future demolition, heal the pruned model *before* quantizing.
+## What this changes immediately
+- The **#59 heal** (next GPU job) should be **QAD (KL-distill), not plain SFT**, with anti-repetition data — Tier-1 #2.
+- Before the heal, **prototype Hadamard rotation** (Tier-1 #1) — it may fix the collapse cheaper than the heal.
+- **Build #60 KL-eval first** — so we measure, not guess (the lesson of the nvfp4 mistake).
+*Sources: per-round inline above — Unsloth Dynamic-v2, QTIP (2406.11235), SpinQuant (2405.16406), PolarQuant (2603.29078), QAD (2601.20088), SPEQ, Recover-LoRA (2606.04238), PonyExl3, SS-MoE (3792218), Agent-RLVR (2506.11425), scaffold-taxonomy (2604.03515), DeepSeek-V4 DSA, int4-KV (2605.05699), XGrammar-2 (2601.04426), REAP (2510.13999), compression-order (2603.18426).*