Text Generation
MLX
Safetensors
English
glm_moe_dsa
apple-silicon
Mixture of Experts
pruned
quantized
soul-targeted
agentic
local-agent
glm
conversational
Eval Results (legacy)
4-bit precision
Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX
Run Hermes
hermes
- MLX LM
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,83 +1,31 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
base_model: zai-org/GLM-5.2
|
| 4 |
-
base_model_relation: quantized
|
| 5 |
library_name: mlx
|
| 6 |
pipeline_tag: text-generation
|
| 7 |
language: [en]
|
| 8 |
-
tags: [mlx, moe, code, agentic, glm, pruned, quantized, verified-decoding, apple-silicon, local-agent
|
| 9 |
-
datasets:
|
| 10 |
-
- open-r1/Mixture-of-Thoughts
|
| 11 |
-
- open-r1/OpenR1-Math-220k
|
| 12 |
-
- open-thoughts/OpenThoughts-114k
|
| 13 |
-
- HuggingFaceH4/ultrachat_200k
|
| 14 |
-
- theblackcat102/evol-codealpaca-v1
|
| 15 |
-
- Salesforce/xlam-function-calling-60k
|
| 16 |
-
- glaiveai/glaive-function-calling-v2
|
| 17 |
-
- SWE-bench/SWE-smith-trajectories
|
| 18 |
-
- internlm/Lean-Workbook
|
| 19 |
---
|
| 20 |
|
| 21 |
-
# GLM-5.2-Demolition — a 743B frontier MoE on a 128 GB Mac
|
| 22 |
-
|
| 23 |
-

|
| 24 |
|
| 25 |
**One line:** we took `zai-org/GLM-5.2` (743B-parameter Mixture-of-Experts, ~381 GB at 4-bit) and
|
| 26 |
-
demolished it to **99
|
| 27 |
-
healed it
|
| 28 |
-
|
| 29 |
-
**
|
| 30 |
-
|
| 31 |
-
It is no longer one niche model. It is **one 99 GB base + a deepening, masters-trained "soul" + a growing
|
| 32 |
-
library of swappable code specialties** — a small **model factory** that runs on a single Mac.
|
| 33 |
-
|
| 34 |
-
## The shape: one base, a soul, swappable code
|
| 35 |
-
|
| 36 |
-
```
|
| 37 |
-
CORE SOUL (always-on, baked toward the masters) SWAPPABLE CODE MODULE (the dial)
|
| 38 |
-
├─ design · dataviz · prose · math ├─ fullstack / AI-eng / DS-ML (default)
|
| 39 |
-
├─ research · architecture × ├─ game / app dev (Unreal·Unity·Godot·Flutter)
|
| 40 |
-
├─ SECURITY — purple-team: crypto · web · net · └─ legacy / enterprise (COBOL·Java·PHP — old AND modern)
|
| 41 |
-
│ secure-code · blue-team + red-team/pentest/CTF
|
| 42 |
-
├─ science (physics · chem · bio)
|
| 43 |
-
└─ perfumery
|
| 44 |
-
```
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
trained on gold spidered from the people who *defined* each field (Rams/Müller-Brockmann for design,
|
| 49 |
-
Kernighan/Knuth for code, Erdős/Pólya for math, Tufte for dataviz, Saltzer-Schroeder for security,
|
| 50 |
-
Strunk/Orwell for prose, Parnas/Uncle-Bob for architecture, Feynman/Popper for research).
|
| 51 |
-
- The **code module** is a swappable ~500 MB adapter: a game dev and an AI engineer load the *same* elite
|
| 52 |
-
design/prose/math/security — only the coding expertise changes. New market = one small adapter, not a new base.
|
| 53 |
-
|
| 54 |
-
## The soul, and how it's built
|
| 55 |
-
The demolished base reverts to the *average* of its training. To make it **elite**, we don't ask it to
|
| 56 |
-
imitate itself (that degenerates) — we **research the masters** and heal toward them:
|
| 57 |
-
|
| 58 |
-
> **spider the elite canon of a field → generate audit-gated, secure-by-default gold → heal a LoRA → scorecard**
|
| 59 |
-
|
| 60 |
-
The current core soul (`adapters-soul2`) is **250 masters-grounded examples across 8 facets** — every one
|
| 61 |
-
`json.dumps`-clean, gated by a per-facet eliteness audit (with a degeneration guard), and **secure-by-default**
|
| 62 |
-
(parameterized queries, AEAD crypto, no hardcoded secrets, validated input). The heal **preserved code**
|
| 63 |
-
(HumanEval held at 116/164 = 70.7%, identical to the prior soul) while adding the full facet breadth.
|
| 64 |
-
|
| 65 |
-
**Design ranges from restraint to maximalism** — Swiss minimalism (Rams · Müller-Brockmann · Vignelli) *and*
|
| 66 |
-
pop-street (Warhol · Banksy · Mr-Brainwash · Murakami), plus Bauhaus, editorial, product-systems, and
|
| 67 |
-
experimental/brutalist movements. **Security is full purple-team** — defensive core (crypto/web/net/secure-coding/
|
| 68 |
-
blue-team) **and** authorized red-team/pentest/CTF (every offensive technique paired with its detection +
|
| 69 |
-
hardening). **Math** spans Furstenberg → Ramsey → Zagier with Lean-4 proofs. Everything uses **current versions**
|
| 70 |
-
(React 19 · PyTorch 2.x · OWASP 2025 · CVE-2025 · Java 21 · PHP 8.4) — *except* the legacy module, which is
|
| 71 |
-
intentionally old (and also carries the modern target: COBOL-on-Kubernetes, Spring Boot 3, .NET 8).
|
| 72 |
|
| 73 |
## How it was made
|
| 74 |
1. **Pruned** the MoE experts 256 → 77 by **router-weighted saliency (REAP** = `router_weight ×
|
| 75 |
activation_norm`, padding-masked), streaming layer-by-layer (~5 GB working set — it never fits in RAM).
|
| 76 |
2. **Quantized** mixed-precision (MLX): experts **3-bit**, attention/embeddings/lm_head **4-bit** → **99 GB**.
|
| 77 |
-
3. **Healed** with **LoRA SFT** (`--no-mask-prompt`, grad-checkpointed
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
|
| 82 |
## What makes it different (built + selftested)
|
| 83 |
- **Verified decoding (compiler-steered):** generates line-by-line while the **real type-checker runs in
|
|
@@ -85,158 +33,84 @@ intentionally old (and also carries the modern target: COBOL-on-Kubernetes, Spri
|
|
| 85 |
Practical *only* on Apple Silicon — unified memory lets the model (GPU) and compiler (CPU) share RAM.
|
| 86 |
- **The verifier mesh:** every output meets its real tool — compile+run+**idiomatic lint** (clippy/ruff/
|
| 87 |
gofmt/prettier) for 5 langs, **SQL** (sqlite), **math** (SymPy), **proofs** (**Lean 4**), design (render+see).
|
| 88 |
-
- **A
|
| 89 |
-
secret-scan, prompt-injection guard, audit, risk-gate),
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
|
|
|
|
|
|
|
|
|
| 93 |
- **Design soul** (render-and-measure critic: WCAG/type-scale/OKLCH), **CallSieve** zero-token retrieval +
|
| 94 |
live-docs RAG, **vision/voice/video** (all MLX), code-rendered math/arch figures (matplotlib/manim/TikZ).
|
| 95 |
|
| 96 |
-
##
|
| 97 |
-
The
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
- **
|
| 104 |
-
|
| 105 |
-
- **
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
##
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
backtracked **as it's written** (TS 0.3 ms · Python ~0 ms · Rust 34 ms per check).
|
| 114 |
-
- **Fabrication-proof `done`** — the agent **re-runs the original tests** before claiming success; it can't hallucinate a pass.
|
| 115 |
-
- **Integrity layer** — test-tamper guard, **16-provider secret-scan**, scope enforcement, slopsquat guard.
|
| 116 |
-
- **The verifier mesh** — every output meets its real tool: compile+run+idiomatic-lint (clippy/ruff/gofmt/prettier) for
|
| 117 |
-
**5 langs**, **SQL** (sqlite), **math** (SymPy), **proofs** (**Lean 4**), and a **design render-critic** (render+see).
|
| 118 |
-
|
| 119 |
-
### Multi-tier M5 Max hardware use (every tier earns its keep, in parallel)
|
| 120 |
-
- **GPU** — decode + NVFP4 + image-gen.
|
| 121 |
-
- **CPU (18 cores)** — runs the whole verify-everything stack in parallel: `verify_many` fans the verifier mesh across
|
| 122 |
-
all 18 cores (**measured 6.6×**) and feeds proof-search.
|
| 123 |
-
- **ANE (16-core Neural Engine)** — embeddings via Apple `NLContextualEmbedding` (`src/ane_embed.py`, `backend=ane`,
|
| 124 |
-
no coremltools): **768-dim, ~9.5 ms**, verified.
|
| 125 |
-
- **SSD** — warm-start **prompt-cache persistence** (`save()/load()` + keyed warm-start; round-trip selftest PASS).
|
| 126 |
-
|
| 127 |
-
### Breadth — the 10-facet soul
|
| 128 |
-
- **One always-on soul** makes the model *elite*, not just correct, across **design · dataviz · prose · math ·
|
| 129 |
-
research · architecture · security (purple-team) · science · perfumery** — trained on master-grounded gold,
|
| 130 |
-
per-facet eliteness-audited, secure-by-default, with code preserved (HumanEval held at **116/164 = 70.7%**).
|
| 131 |
-
- **Formal-math Lean prover** (`66_prove`) — local Lean-4 prover lane: **miniF2F-test 32/226 = 14.2% pass@4**,
|
| 132 |
-
**Lean-verified**, contamination-checked.
|
| 133 |
-
|
| 134 |
-
### Multimodal stack (all MLX)
|
| 135 |
-
- **Vision** (Qwen3-VL-4B-8bit), **image-gen**, **video**, and **structured tools** — plus code_intel across 5 langs.
|
| 136 |
-
|
| 137 |
-
### The model factory
|
| 138 |
-
- **Swappable domain adapters** on one base (download once): each capability is a ~500 MB LoRA. **Pattern A =
|
| 139 |
-
base + module gold** (a game dev and an AI engineer share the *same* elite soul; only the code module swaps).
|
| 140 |
-
- **Shipped souls:** **soul2 ✓** and **soul-v3 ✓** (on HF); `heal_queue.sh` driver is autonomous.
|
| 141 |
-
- **In the heal queue:** `fullstack` (healing now) → `gamedev` → `legacy` → FACTORY_DONE.
|
| 142 |
|
| 143 |
## Requirements
|
| 144 |
- **Apple Silicon, 128 GB** unified memory (M5-class recommended), macOS 26/27+. **MLX ≥ 0.31.**
|
| 145 |
-
- The architecture (`glm_moe_dsa`: MLA + DSA sparse attention) needs the **bundled
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
~110 GB by default, so it OOM-crashes on long generations. Fix before serving:
|
| 151 |
```bash
|
| 152 |
sudo sysctl iogpu.wired_limit_mb=122000 # 122 GB; one-shot (resets on reboot)
|
| 153 |
sudo bash dist/install_gpu_limit.sh # OR: persist it via a LaunchDaemon
|
| 154 |
```
|
|
|
|
| 155 |
|
| 156 |
## Use it
|
| 157 |
```bash
|
| 158 |
python dist/install_glm_dsa_patch.py # patch mlx_lm (venv AND LM Studio's bundled engine)
|
| 159 |
-
# serve the base + the soul (the swappable adapter is how you pick the specialty):
|
| 160 |
GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
|
| 161 |
-
--adapter-path adapters-
|
| 162 |
-
#
|
| 163 |
-
curl -s localhost:8080/v1/chat/completions -H 'Content-Type: application/json' \
|
| 164 |
-
-d '{"messages":[{"role":"user","content":"Write a typed debounce in TypeScript."}],"chat_template_kwargs":{"enable_thinking":true}}'
|
| 165 |
-
# drive the 51-tool agent on your repo:
|
| 166 |
python scripts/57_tool_agent.py --repo /path/to/your/repo --apply --task "..." --test "cargo test"
|
|
|
|
| 167 |
```
|
| 168 |
-
In **LM Studio**: run the patch, fully quit + reopen, then load the model.
|
| 169 |
-
memory-capped at ~11–14 tok/s — ALL speculative methods measured-DEAD on this MoE (see `SPEED.md`); throughput = batching.
|
| 170 |
|
| 171 |
-
## Performance (M5 Max 128 GB)
|
| 172 |
| Metric | Value |
|
| 173 |
|---|---|
|
| 174 |
-
| Size | 99
|
| 175 |
-
| HumanEval pass@1 | **
|
| 176 |
-
| Math GSM8K | **8/12 (66%)** —
|
| 177 |
-
| miniF2F-test (formal proof) | **32/226 (14.2%)** — pass@4, Lean-verified, contamination-checked; a general model, NOT a specialized prover |
|
| 178 |
| Algebra (SymPy-checked) | **3/4 (75%)** |
|
| 179 |
-
| Decode speed
|
| 180 |
-
|
|
| 181 |
-
|
| 182 |
-
**Speed in one line:** single-stream is memory-capped at **~11–14 tok/s** — every speculative method (MTP / EAGLE /
|
| 183 |
-
prompt-lookup / dsa-block-size) was *measured* and is dead on this memory-bound MoE. The real win is **batching: a
|
| 184 |
-
measured 2.6× throughput**, which `mlx_lm.server` delivers **natively** on concurrent requests. Receipts: [`SPEED.md`](SPEED.md).
|
| 185 |
-
|
| 186 |
-
**Benchmark honesty:** HumanEval is the **full 164** (116/164 = 70.7%, single-shot); GSM8K (**n=12**) is a **small
|
| 187 |
-
held-out subset**; miniF2F **is** the full 226. Every number is **contamination-checked** (0% / 0% / 0.4% near-dup) —
|
| 188 |
-
**reasoned, not memorized**. Honest frontier-vs-us comparison + projections: [`BENCHMARKS.md`](BENCHMARKS.md).
|
| 189 |
-
|
| 190 |
-
## The factory — swappable souls & code, one base
|
| 191 |
-
The spider→gold→heal recipe is **domain-agnostic**: "make a model elite at X" is now a repeatable procedure.
|
| 192 |
-
On the one 99 GB base (downloaded once), each new capability is a ~500 MB adapter:
|
| 193 |
-
- **Core soul** — design · dataviz · prose · math · research · architecture · security (purple-team) · science · perfumery.
|
| 194 |
-
- **Code modules (swap one)** — `fullstack/AI-eng/DS-ML` (RAG, agents, MLOps, deep-learning, data-eng, web/devops) ·
|
| 195 |
-
`game/app` (Unreal C++/Blueprints, Unity C#, Godot GDScript, Flutter/Dart, Nystrom patterns, shaders, netcode) ·
|
| 196 |
-
`legacy` (COBOL/mainframe, enterprise Java, PHP — classic **and** modernized to Java 21 / PHP 8.4 / .NET 8 / COBOL-on-K8s).
|
| 197 |
-
- Verified by design: each code module's gold targets a **compile-verification** pass (the leap Lean gave miniF2F).
|
| 198 |
-
|
| 199 |
-
**Swap a module + build a new specialty:** full mechanics — runtime swap · the two soul-merge patterns ·
|
| 200 |
-
the spider→gold→heal recipe · the rules — are in [`FACTORY.md`](FACTORY.md). The model is also being taught
|
| 201 |
-
its *own* factory (route a task → the right module, emitting a `<module>…</module>` signal), so it can self-select the specialty.
|
| 202 |
-
|
| 203 |
-
## Roadmap — what's queued next
|
| 204 |
-
Honest queue (the live kanban is `BACKLOG.md`). These are **not built yet** — the Features section above is:
|
| 205 |
-
- **ANE vision (#79)** — move the vision encoder onto the Neural Engine (the big ANE win; convert-friendly model).
|
| 206 |
-
- **ANE speech (#87)** — Whisper / `SFSpeechRecognizer` on the ANE — a voice lane.
|
| 207 |
-
- **SSD-backed long-context KV (#86)** — KV offload to the 14.5 GB/s SSD for long context (attacks our weakest axis
|
| 208 |
-
vs 1M-ctx rivals; the #85 prompt-cache plumbing is already done).
|
| 209 |
-
- **Metal-4 TensorOps fused-MoE kernel (#81)** — custom fused kernel, the new M5 decode lever (**~30–60% decode**,
|
| 210 |
-
`research/mlx_speed_deepdive.md`); distinct from the (dead) speculative methods.
|
| 211 |
-
- **#59 NVFP4 collapse-fix** — saliency-dynamic quant (early/late experts at 4-bit) to cure long-gen Computation Collapse;
|
| 212 |
-
tooling is wired, GPU-gated behind the factory.
|
| 213 |
-
- **Agentic-gold heal (#84)** — heal the 23 staged agentic-gold examples into the soul.
|
| 214 |
-
|
| 215 |
-
## Roadmap — the Demolition family (shrink, keep the soul)
|
| 216 |
-
Same masters-trained soul, every Mac — the elite training lives in the size-agnostic calibration + heal corpus:
|
| 217 |
-
```
|
| 218 |
-
~106GB : ████████ 77 experts · 3-bit (this model) → 128 GB Mac
|
| 219 |
-
67GB : ██████ 46 experts · 3-bit → 96 GB Mac
|
| 220 |
-
55GB : █████ 36 experts · 3-bit → 64 GB Mac
|
| 221 |
-
36GB : ███ 26 experts · 2.5-bit → 48 GB Mac
|
| 222 |
-
20GB : ██ 16 experts · 2-bit ⚗️ → 32 GB Mac
|
| 223 |
-
14GB : █ 8 experts · 2-bit ⚗️ (the floor) → 24 GB Mac
|
| 224 |
-
```
|
| 225 |
-
Sizes **measured** from the build: **~10.4 GB fixed base** + experts × ~1.24 GB × bits/3. The base dominates,
|
| 226 |
-
so **below ~13 GB is impossible** — the right column is your **minimum Mac RAM**.
|
| 227 |
|
| 228 |
## Honest limitations
|
| 229 |
-
- **Specialist
|
| 230 |
-
- **Speed ~11
|
| 231 |
-
(
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
- **
|
|
|
|
|
|
|
|
|
|
| 238 |
|
| 239 |
## Attribution & license
|
| 240 |
**MIT.** Base model © **Z.ai** (`zai-org/GLM-5.2`, MIT-licensed) — so this derivative is MIT too: free
|
| 241 |
-
to use, modify, and redistribute **with attribution to Z.ai**. The demolition / healing /
|
| 242 |
tooling is this repo's contribution.
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
base_model: zai-org/GLM-5.2
|
|
|
|
| 4 |
library_name: mlx
|
| 5 |
pipeline_tag: text-generation
|
| 6 |
language: [en]
|
| 7 |
+
tags: [mlx, moe, code, agentic, glm, pruned, quantized, verified-decoding, apple-silicon, local-agent]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# GLM-5.2-Demolition — a 743B frontier MoE, demolished to run on a 128 GB Mac
|
|
|
|
|
|
|
| 11 |
|
| 12 |
**One line:** we took `zai-org/GLM-5.2` (743B-parameter Mixture-of-Experts, ~381 GB at 4-bit) and
|
| 13 |
+
demolished it to **99 GB** so it runs **fully on-device on a MacBook Pro M5 Max (128 GB)** — then
|
| 14 |
+
healed it and wrapped it in a **47-tool local agent** that does things a cloud model structurally
|
| 15 |
+
cannot: the **compiler steers every line it writes**, it **can't fake a passing test or leak a
|
| 16 |
+
secret**, and it can be **fine-tuned on *your* private repo** so it writes in your style.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
+
A **niche specialist**, not a general model — tuned to beat a frontier model *in one lane* (agentic
|
| 19 |
+
coding + design for **TS/JS/Python/Rust/Go/HTML/CSS** + Postgres) by out-*verifying* it, not out-knowing it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
## How it was made
|
| 22 |
1. **Pruned** the MoE experts 256 → 77 by **router-weighted saliency (REAP** = `router_weight ×
|
| 23 |
activation_norm`, padding-masked), streaming layer-by-layer (~5 GB working set — it never fits in RAM).
|
| 24 |
2. **Quantized** mixed-precision (MLX): experts **3-bit**, attention/embeddings/lm_head **4-bit** → **99 GB**.
|
| 25 |
+
3. **Healed** with **LoRA SFT** (`--no-mask-prompt`, grad-checkpointed). The current **v4** rebuild uses a
|
| 26 |
+
**code-first balanced calibration** (so the *math* super-experts survive the prune — v3's coding-only
|
| 27 |
+
calibration collapsed math) + heal/distill on **R1 long-CoT reasoning traces**. Router-KD / expert-wise
|
| 28 |
+
Logit-KD are research-validated recovery stages (optional). *(GRPO/RLVR was tried and regressed → SFT.)*
|
| 29 |
|
| 30 |
## What makes it different (built + selftested)
|
| 31 |
- **Verified decoding (compiler-steered):** generates line-by-line while the **real type-checker runs in
|
|
|
|
| 33 |
Practical *only* on Apple Silicon — unified memory lets the model (GPU) and compiler (CPU) share RAM.
|
| 34 |
- **The verifier mesh:** every output meets its real tool — compile+run+**idiomatic lint** (clippy/ruff/
|
| 35 |
gofmt/prettier) for 5 langs, **SQL** (sqlite), **math** (SymPy), **proofs** (**Lean 4**), design (render+see).
|
| 36 |
+
- **A 47-tool agent** with **five defense layers** the frontier lacks out of the box:
|
| 37 |
+
**trust** (checkpoint/rollback, secret-scan, prompt-injection guard, audit, risk-gate),
|
| 38 |
+
**reliability** (constraint-pinning vs context-rot, false-success guard, flaky-test re-run, onboarding map),
|
| 39 |
+
**self-improvement** (skill library, large-output pointers, clarify-before-assuming),
|
| 40 |
+
**integrity** (test-tamper guard, fabrication-proof `done`, scope enforcement, slopsquat guard),
|
| 41 |
+
plus a **humanizer** (kills AI-slop, matches your voice).
|
| 42 |
+
- **Own your repo:** `scripts/64_own_your_repo.py` fine-tunes the model on *your* private codebase so it
|
| 43 |
+
writes in your style — a cloud flagship can't be tuned on your private code.
|
| 44 |
- **Design soul** (render-and-measure critic: WCAG/type-scale/OKLCH), **CallSieve** zero-token retrieval +
|
| 45 |
live-docs RAG, **vision/voice/video** (all MLX), code-rendered math/arch figures (matplotlib/manim/TikZ).
|
| 46 |
|
| 47 |
+
## Every chip on the M5 Max, working
|
| 48 |
+
The agent spreads perception, verification, and dispatch across **all six compute blocks** so the GPU stays
|
| 49 |
+
free for token generation (built + selftested):
|
| 50 |
+
- **GPU** (40-core + M5 Neural Accelerators) — the 99 GB model decodes + LoRA-heals.
|
| 51 |
+
- **Neural Engine** (16-core) — embeddings · OCR · image segmentation / pose / object-detection · NER+POS ·
|
| 52 |
+
audio classification + VAD · neural TTS · zero-shot routing · rerank — all via Apple frameworks, no CoreML, no GPU.
|
| 53 |
+
- **18 CPU cores** — the verifier mesh fanned out (`verify_many`, 6.6×) · 9-language compile-verify · tabular ML.
|
| 54 |
+
- **Media Engine** — hardware H.264/HEVC/AV1 decode + encode for the video lane.
|
| 55 |
+
- **AMX/SME** — matrix coprocessor via Accelerate (~2.1 TFLOP/s f32), implicit in every numpy op.
|
| 56 |
+
- **ASR** = **Whisper on MLX** (no mic-permission needed). An **Any-to-Any omni-router** sends any input
|
| 57 |
+
(text / image / audio / video / table) to its optimal block.
|
| 58 |
+
|
| 59 |
+
## The model factory (swappable domain souls)
|
| 60 |
+
One 99 GB base + hot-swappable LoRA "souls" (~100 MB each) — change the model's specialty by swapping the
|
| 61 |
+
adapter: **code · design · agentic · gamedev · legacy/enterprise · security · fullstack · science · data ·
|
| 62 |
+
perfumery**. Each is healed from the same base by an autonomous chain that forges the whole library overnight
|
| 63 |
+
on the one Mac — and a `factory`-dispatcher soul makes the model route requests to the right specialty itself.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
## Requirements
|
| 66 |
- **Apple Silicon, 128 GB** unified memory (M5-class recommended), macOS 26/27+. **MLX ≥ 0.31.**
|
| 67 |
+
- The architecture (`glm_moe_dsa`: MLA + DSA sparse attention) needs the **bundled patches** — stock
|
| 68 |
+
mlx_lm can't load it.
|
| 69 |
+
- **⚠️ Raise the GPU memory ceiling — required.** The model needs ~101.6 GB; macOS caps the GPU
|
| 70 |
+
working set at ~110 GB by default, so it OOM-crashes (Metal command-buffer timeout) on long
|
| 71 |
+
generations. Fix before serving:
|
|
|
|
| 72 |
```bash
|
| 73 |
sudo sysctl iogpu.wired_limit_mb=122000 # 122 GB; one-shot (resets on reboot)
|
| 74 |
sudo bash dist/install_gpu_limit.sh # OR: persist it via a LaunchDaemon
|
| 75 |
```
|
| 76 |
+
Without this the model appears to "randomly crash" — it's just memory-starved.
|
| 77 |
|
| 78 |
## Use it
|
| 79 |
```bash
|
| 80 |
python dist/install_glm_dsa_patch.py # patch mlx_lm (venv AND LM Studio's bundled engine)
|
|
|
|
| 81 |
GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
|
| 82 |
+
--adapter-path heal/adapters-v4 # serve (OpenAI-compatible); v2 + heal/adapters also ship
|
| 83 |
+
# drive the 47-tool agent on your repo:
|
|
|
|
|
|
|
|
|
|
| 84 |
python scripts/57_tool_agent.py --repo /path/to/your/repo --apply --task "..." --test "cargo test"
|
| 85 |
+
# speed: try --dsa-block-size 32/64/128 (free, pick fastest). External draft is Metal-unstable here; MTP self-spec is the real path.
|
| 86 |
```
|
| 87 |
+
In **LM Studio**: run the patch, fully quit + reopen, then load the model.
|
|
|
|
| 88 |
|
| 89 |
+
## Performance (M5 Max 128 GB, v4)
|
| 90 |
| Metric | Value |
|
| 91 |
|---|---|
|
| 92 |
+
| Size | 99 GB (from 381 GB mxfp4 / ~1.5 TB bf16) |
|
| 93 |
+
| HumanEval pass@1 | **19/20 (95%)**, single-shot |
|
| 94 |
+
| Math GSM8K | **8/12 (66%)** — recovered from v3's **0/5** (code-first balanced calibration kept the math super-experts alive through the prune) |
|
|
|
|
| 95 |
| Algebra (SymPy-checked) | **3/4 (75%)** |
|
| 96 |
+
| Decode speed | **11.3 tok/s** (no draft) — see the speed note in limitations |
|
| 97 |
+
| Verified-decode checker | TS 0.3 ms · Python ~0 ms · Rust 34 ms |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
## Honest limitations
|
| 100 |
+
- **Specialist:** ~70% of experts pruned — strong in the target niche, weaker outside it. Not the full 743B.
|
| 101 |
+
- **Speed ~11 tok/s decode** (reading pace; ~3 min for long thinking-ON answers). Partly MLX's still-naive
|
| 102 |
+
**DSA attention kernels** (mlx #837 / #3402 — *improves for free* as MLX matures), partly the bandwidth
|
| 103 |
+
cost of a 743B-class MoE on a laptop. **Measured dead-ends** (don't bother): 4-bit re-quant is *slower*
|
| 104 |
+
for single-token decode (bandwidth-bound, smaller wins); active-experts 8→4 gives no win at batch=1.
|
| 105 |
+
**Real path:** `--dsa-block-size` sweep (free) → upstream MLX → **MTP self-speculative** (~2.6×, a port
|
| 106 |
+
for this arch). Not a quant change.
|
| 107 |
+
- **Multilingual** ability reduced (optional vocab-trim drops ~31% of tokens).
|
| 108 |
+
- **Design** is competent but not yet design-soul-elite (correct structure, but missed OKLCH/grid when
|
| 109 |
+
tested) — the design-canon heal closes this.
|
| 110 |
+
- Prompt-cache can OOM under heavy concurrent load. The external speculative draft is **Metal-unstable**
|
| 111 |
+
on this MoE — **MTP self-speculative is the right path**; the external draft is not recommended.
|
| 112 |
|
| 113 |
## Attribution & license
|
| 114 |
**MIT.** Base model © **Z.ai** (`zai-org/GLM-5.2`, MIT-licensed) — so this derivative is MIT too: free
|
| 115 |
+
to use, modify, and redistribute **with attribution to Z.ai**. The demolition / healing / 47-tool agent
|
| 116 |
tooling is this repo's contribution.
|