philipjohnbasile's picture
Upload README.md with huggingface_hub
a88adac verified
|
Raw
History Blame
12.8 kB
---
license: mit
base_model: zai-org/GLM-5.2
library_name: mlx
pipeline_tag: text-generation
language: [en]
tags: [mlx, moe, code, agentic, glm, pruned, quantized, verified-decoding, apple-silicon, local-agent]
---
# GLM-5.2-Demolition β€” a 743B frontier MoE, demolished to run on a 128 GB Mac
![My AI-Engineer build β€” one model, fully local, verify everything: AI-Engineer / Full-Stack / Data-Science / ML, on Apple Silicon](ai-engineer.png)
**One line:** we took `zai-org/GLM-5.2` (743B-parameter Mixture-of-Experts, ~381 GB at 4-bit) and
demolished it to **99 GB** so it runs **fully on-device on a MacBook Pro M5 Max (128 GB)** β€” then
healed it and wrapped it in a **51-tool local agent** that does things a cloud model structurally
cannot: the **compiler steers every line it writes**, it **can't fake a passing test or leak a
secret**, and it can be **fine-tuned on *your* private repo** so it writes in your style.
A **niche specialist**, not a general model β€” tuned to beat a frontier model *in one lane* (agentic
coding + design for **TS/JS/Python/Rust/Go/HTML/CSS** + Postgres) by out-*verifying* it, not out-knowing it.
## My AI-Engineer / Full-Stack / Data-Science / ML build
This is the version I run wearing all four hats β€” **one on-device model, no cloud key**, tooled for the
whole stack of those roles (strongest in the coding/agentic lane, deliberately so):
- **AI Engineer** β€” *builds and ships agentic AI locally*: the 51-tool ReAct agent, **verified +
constrained decoding**, grammar-constrained tool I/O, MLX-native serving + the speed/stability work
(prompt-cache, continuous batching, frontier-grade serving). The model that *makes* AI products.
- **Full-Stack** β€” front-to-back in **TS/JS/Python/Rust/Go/HTML/CSS + Postgres**, the **compiler steering
every line**, a **design soul** (render-and-see critic: WCAG / type-scale / OKLCH) for the UI, and
**SQL-on-a-real-schema** for the API — plus edit→test→fix agentic loops on *your* repo.
- **Data Science** β€” stateful **REPL**, **SymPy / pandas / numpy / sklearn**, arXiv-RAG, competition-grade
math (GSM8K-style), and **code-rendered figures** (matplotlib / manim / TikZ).
- **Machine Learning** β€” it *is* applied ML end-to-end: **REAP expert-pruning** (256β†’77), **mixed-precision
quantization**, **LoRA healing**, **distillation**, **MTP self-speculation**, GRPO/RLVR experiments β€” the
build itself is a working reference.
…**and the hats that fall straight out of "verify-everything":**
- **Security / DevSecOps** β€” secret-scanning (16 providers: AWS/GitHub/OpenAI/**Anthropic/HuggingFace**/Slack/Stripe/Google/DB-URLs/JWT/PEM…),
prompt-injection guard, test-tamper + **fabrication-proof `done`**, slopsquat/typosquat guard, risk-gated
tools. It structurally **can't leak a key or fake a green test**.
- **Formal-Methods / Verification Engineer** β€” a local **Lean-4** prover (premise selection, expert-iteration,
self-correction from the *real* Lean error) β†’ **correct-by-construction** math, not vibes.
- **MLOps / Inference** β€” the serving spine: prompt-cache, continuous batching, watchdog + circuit-breaker +
memory-ceiling β€” **frontier-grade stability** for hours-long local runs on one box.
- **Multimodal / CV** β€” reads images + video (VLM), palette-steered **image-gen**, code-rendered
video/figures (**manim/TikZ**) β€” all MLX.
- **Design Engineer** β€” a render-and-***see*** critic enforcing **WCAG** contrast, modular type scale,
8 px grid, **OKLCH** harmony (not just "looks fine").
One model, fully local, **verify-everything** β€” every hat above, on a MacBook.
## How it was made
1. **Pruned** the MoE experts 256 β†’ 77 by **router-weighted saliency (REAP** = `router_weight Γ—
activation_norm`, padding-masked), streaming layer-by-layer (~5 GB working set β€” it never fits in RAM).
2. **Quantized** mixed-precision (MLX): experts **3-bit**, attention/embeddings/lm_head **4-bit** β†’ **99 GB**.
3. **Healed** with **LoRA SFT** (`--no-mask-prompt`, grad-checkpointed). The current **v4** rebuild uses a
**code-first balanced calibration** (so the *math* super-experts survive the prune β€” v3's coding-only
calibration collapsed math) + heal/distill on **R1 long-CoT reasoning traces**. Router-KD / expert-wise
Logit-KD are research-validated recovery stages (optional). *(GRPO/RLVR was tried and regressed β†’ SFT.)*
## What makes it different (built + selftested)
- **Verified decoding (compiler-steered):** generates line-by-line while the **real type-checker runs in
the loop**; a line that adds an error is backtracked. TS 0.3 ms Β· Python ~0 ms Β· Rust 34 ms per check.
Practical *only* on Apple Silicon β€” unified memory lets the model (GPU) and compiler (CPU) share RAM.
- **The verifier mesh:** every output meets its real tool β€” compile+run+**idiomatic lint** (clippy/ruff/
gofmt/prettier) for 5 langs, **SQL** (sqlite), **math** (SymPy), **proofs** (**Lean 4**), design (render+see).
- **A 51-tool agent** with **five defense layers** the frontier lacks out of the box:
**trust** (checkpoint/rollback, secret-scan, prompt-injection guard, audit, risk-gate),
**reliability** (constraint-pinning vs context-rot, false-success guard, flaky-test re-run, onboarding map),
**self-improvement** (skill library, large-output pointers, clarify-before-assuming),
**integrity** (test-tamper guard, fabrication-proof `done`, scope enforcement, slopsquat guard),
plus a **humanizer** (kills AI-slop, matches your voice).
- **Own your repo:** `scripts/64_own_your_repo.py` fine-tunes the model on *your* private codebase so it
writes in your style β€” a cloud flagship can't be tuned on your private code.
- **Design soul** (render-and-measure critic: WCAG/type-scale/OKLCH), **CallSieve** zero-token retrieval +
live-docs RAG, **vision/voice/video** (all MLX), code-rendered math/arch figures (matplotlib/manim/TikZ).
## Requirements
- **Apple Silicon, 128 GB** unified memory (M5-class recommended), macOS 26/27+. **MLX β‰₯ 0.31.**
- The architecture (`glm_moe_dsa`: MLA + DSA sparse attention) needs the **bundled patch** (`glm_moe_dsa.py`
+ `install_glm_dsa_patch.py`) β€” current stock mlx_lm can't load it. **Native support is landing upstream**
([ml-explore/mlx-lm PR #1410](https://github.com/ml-explore/mlx-lm/pull/1410)); once it merges, recent mlx_lm
loads this model with **no patch** β€” the bundled patch is the interim loader for older versions.
- **⚠️ Raise the GPU memory ceiling β€” required.** The model needs ~101.6 GB; macOS caps the GPU
working set at ~110 GB by default, so it OOM-crashes (Metal command-buffer timeout) on long
generations. Fix before serving:
```bash
sudo sysctl iogpu.wired_limit_mb=122000 # 122 GB; one-shot (resets on reboot)
sudo bash dist/install_gpu_limit.sh # OR: persist it via a LaunchDaemon
```
Without this the model appears to "randomly crash" β€” it's just memory-starved.
## Use it
```bash
python dist/install_glm_dsa_patch.py # patch mlx_lm (venv AND LM Studio's bundled engine)
GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
--adapter-path heal/adapters-v4 # serve (OpenAI-compatible); v2 + heal/adapters also ship
# query it β€” `enable_thinking` toggles the reasoning trace (GLM-specific; off = faster, on = harder problems):
curl -s localhost:8080/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":"Write a typed debounce in TypeScript."}],"chat_template_kwargs":{"enable_thinking":true}}'
# drive the 51-tool agent on your repo:
python scripts/57_tool_agent.py --repo /path/to/your/repo --apply --task "..." --test "cargo test"
# speed: try --dsa-block-size 32/64/128 (free, pick fastest). External draft is Metal-unstable here; MTP self-spec is the real path.
```
In **LM Studio**: run the patch, fully quit + reopen, then load the model.
**Design β€” elite, not just competent** (full guide + copy-paste system prompt: [`design/DESIGN.md`](design/DESIGN.md), with 9 movement-grounded gold seeds): the base prior reverts to the *average* of its training (hex + arbitrary
spacing), so steer + gate it. Prepend `src/design_canon.py`'s `CANON` (oklch-only Β· 8px grid Β· 1.25 type scale Β·
WCAG Β· **bespoke β€” no Bootstrap/Tailwind/framework cookie-cutter**) as the system prompt for elite output
*today*; `audit_design()` gates eliteness (OKLCH/grid/scale + rejects framework boilerplate) and the
constrained decoder bans non-OKLCH tokens; `scripts/76_design_flywheel.py` (generate→audit→keep-only-elite→SFT)
heals the **native** prior so it designs elite with no prompt at all.
## Performance (M5 Max 128 GB, v4)
| Metric | Value |
|---|---|
| Size | 99 GB (from 381 GB mxfp4 / ~1.5 TB bf16) |
| HumanEval pass@1 | **19/20 (95%)**, single-shot |
| Math GSM8K | **8/12 (66%)** β€” recovered from v3's **0/5** (code-first balanced calibration kept the math super-experts alive through the prune) |
| Algebra (SymPy-checked) | **3/4 (75%)** |
| Decode speed | **11.3 tok/s** (no draft) β€” see the speed note in limitations |
| Verified-decode checker | TS 0.3 ms Β· Python ~0 ms Β· Rust 34 ms |
**Benchmark honesty:** every number is **contamination-checked** β€” HumanEval, GSM8K, and miniF2F test problems are
*not* in the training data (0 % / 0 % / 0.4 % near-dup), so they're **reasoned, not memorized**. Method + full
training-data provenance/licenses: [`TRAINING_DATA.md`](TRAINING_DATA.md).
## Which version for your runtime (June 2026 β€” MLX is now everywhere on Apple Silicon)
| Runtime | MLX *(this repo)* | GGUF *(with the family)* |
|---|---|---|
| `mlx_lm` (CLI / server) | βœ… native | β€” |
| **LM Studio** | βœ… Mac (dual-backend) | βœ… Win/Linux |
| **Ollama 0.19+** | βœ… Mac (MLX engine, since Mar 2026) | βœ… 0.30 (llama.cpp) |
| **macMLX** | βœ… native (SwiftUI + OpenAI API) | β€” |
| `llama.cpp` | β€” | βœ… |
| mlx-swift apps | βœ… when `glm_moe_dsa` lands in mlx-swift-lm | β€” |
**MLX is the native Apple-Silicon path** β€” mlx_lm Β· LM Studio (Mac) Β· **Ollama 0.19+** Β· macMLX all run it
(MLX beats llama.cpp ~30-40% on M5). **GGUF** (shipped with the family) covers llama.cpp + Windows/Linux.
Every MLX runtime gets this model the moment `glm_moe_dsa` lands upstream
([mlx-lm PR #1410](https://github.com/ml-explore/mlx-lm/pull/1410)) β€” or **today** via `install_glm_dsa_patch.py`,
which scans *every* mlx_lm install (LM Studio's, Ollama's, your venv's).
## Roadmap β€” the Demolition family (shrink, keep the soul)
Same masters-trained soul (design Β· dataviz Β· code Β· security Β· math Β· prose Β· architecture Β· research), every
Mac β€” the elite training lives in the facet-inclusive calibration + heal corpus, which are **size-agnostic**:
```
99GB : β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ (baseline, this model)
64GB : should hold ~baseline (96 GB Macs)
48GB : should hold high (64 GB Macs)
28GB : the squeeze β€” watch which facets dip (36-48 GB Macs)
14GB : βš—οΈ where does the soul start to break? (24 GB Macs)
7GB : βš—οΈ the floor (16 GB laptops)
```
Each size: facet-calib β†’ prune harder β†’ quantize β†’ heal (the soul corpus) β†’ soul-retention scorecard (% elite
per facet). See [`design/DESIGN.md`](design/DESIGN.md).
## Honest limitations
- **Specialist:** ~70% of experts pruned β€” strong in the target niche, weaker outside it. Not the full 743B.
- **Speed ~11 tok/s decode** (reading pace; ~3 min for long thinking-ON answers). Partly MLX's still-naive
**DSA attention kernels** (mlx #837 / #3402 β€” *improves for free* as MLX matures), partly the bandwidth
cost of a 743B-class MoE on a laptop. **Measured dead-ends** (don't bother): 4-bit re-quant is *slower*
for single-token decode (bandwidth-bound, smaller wins); active-experts 8β†’4 gives no win at batch=1.
**Real path:** `--dsa-block-size` sweep (free) β†’ upstream MLX β†’ **MTP self-speculative** (~2.6Γ—, a port
for this arch). Not a quant change.
- **Multilingual** ability reduced (optional vocab-trim drops ~31% of tokens).
- **Design** is competent but not yet design-soul-elite (correct structure, but missed OKLCH/grid when
tested) β€” the design-canon heal closes this.
- Prompt-cache can OOM under heavy concurrent load. The external speculative draft is **Metal-unstable**
on this MoE β€” **MTP self-speculative is the right path**; the external draft is not recommended.
## Attribution & license
**MIT.** Base model Β© **Z.ai** (`zai-org/GLM-5.2`, MIT-licensed) β€” so this derivative is MIT too: free
to use, modify, and redistribute **with attribution to Z.ai**. The demolition / healing / 51-tool agent
tooling is this repo's contribution.