Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX

Run Hermes

hermes

MLX LM

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

philipjohnbasile commited on 13 days ago

Commit

6e9665d

verified ·

1 Parent(s): d72ec2e

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +66 -192

README.md CHANGED Viewed

@@ -1,83 +1,31 @@
 ---
 license: mit
 base_model: zai-org/GLM-5.2
-base_model_relation: quantized
 library_name: mlx
 pipeline_tag: text-generation
 language: [en]
-tags: [mlx, moe, code, agentic, glm, pruned, quantized, verified-decoding, apple-silicon, local-agent, conversational, soul, design, security, multi-domain]
-datasets:
-  - open-r1/Mixture-of-Thoughts
-  - open-r1/OpenR1-Math-220k
-  - open-thoughts/OpenThoughts-114k
-  - HuggingFaceH4/ultrachat_200k
-  - theblackcat102/evol-codealpaca-v1
-  - Salesforce/xlam-function-calling-60k
-  - glaiveai/glaive-function-calling-v2
-  - SWE-bench/SWE-smith-trajectories
-  - internlm/Lean-Workbook
 ---
-# GLM-5.2-Demolition — a 743B frontier MoE on a 128 GB Mac, with a masters-trained soul
-![One on-device base, a masters-trained soul, and swappable code specialties — on Apple Silicon](ai-engineer.png)
 **One line:** we took `zai-org/GLM-5.2` (743B-parameter Mixture-of-Experts, ~381 GB at 4-bit) and
-demolished it to **99 GiB (~106 GB)** so it runs **fully on-device on a MacBook Pro M5 Max (128 GB)** — then
-healed it toward **the actual masters of every field** and wrapped it in a **51-tool local agent** that does
-things a cloud model structurally cannot: the **compiler steers every line it writes**, it
-**re-verifies tests on `done`**, and it **blocks known-format secret writes**.
-It is no longer one niche model. It is **one 99 GB base + a deepening, masters-trained "soul" + a growing
-library of swappable code specialties** — a small **model factory** that runs on a single Mac.
-## The shape: one base, a soul, swappable code
-```
-CORE SOUL  (always-on, baked toward the masters)        SWAPPABLE CODE MODULE  (the dial)
-├─ design · dataviz · prose · math                      ├─ fullstack / AI-eng / DS-ML   (default)
-├─ research · architecture                       ×      ├─ game / app dev  (Unreal·Unity·Godot·Flutter)
-├─ SECURITY — purple-team: crypto · web · net ·         └─ legacy / enterprise  (COBOL·Java·PHP — old AND modern)
-│   secure-code · blue-team + red-team/pentest/CTF
-├─ science (physics · chem · bio)
-└─ perfumery
-```
-- The **base** (99 GB) is built once and never changes — it's the expensive part (pruning + quantizing 743B).
-- The **soul** is a small (~500 MB) LoRA that makes the model *elite* — not just correct — at every facet,
-  trained on gold spidered from the people who *defined* each field (Rams/Müller-Brockmann for design,
-  Kernighan/Knuth for code, Erdős/Pólya for math, Tufte for dataviz, Saltzer-Schroeder for security,
-  Strunk/Orwell for prose, Parnas/Uncle-Bob for architecture, Feynman/Popper for research).
-- The **code module** is a swappable ~500 MB adapter: a game dev and an AI engineer load the *same* elite
-  design/prose/math/security — only the coding expertise changes. New market = one small adapter, not a new base.
-## The soul, and how it's built
-The demolished base reverts to the *average* of its training. To make it **elite**, we don't ask it to
-imitate itself (that degenerates) — we **research the masters** and heal toward them:
-> **spider the elite canon of a field → generate audit-gated, secure-by-default gold → heal a LoRA → scorecard**
-The current core soul (`adapters-soul2`) is **250 masters-grounded examples across 8 facets** — every one
-`json.dumps`-clean, gated by a per-facet eliteness audit (with a degeneration guard), and **secure-by-default**
-(parameterized queries, AEAD crypto, no hardcoded secrets, validated input). The heal **preserved code**
-(HumanEval held at 116/164 = 70.7%, identical to the prior soul) while adding the full facet breadth.
-**Design ranges from restraint to maximalism** — Swiss minimalism (Rams · Müller-Brockmann · Vignelli) *and*
-pop-street (Warhol · Banksy · Mr-Brainwash · Murakami), plus Bauhaus, editorial, product-systems, and
-experimental/brutalist movements. **Security is full purple-team** — defensive core (crypto/web/net/secure-coding/
-blue-team) **and** authorized red-team/pentest/CTF (every offensive technique paired with its detection +
-hardening). **Math** spans Furstenberg → Ramsey → Zagier with Lean-4 proofs. Everything uses **current versions**
-(React 19 · PyTorch 2.x · OWASP 2025 · CVE-2025 · Java 21 · PHP 8.4) — *except* the legacy module, which is
-intentionally old (and also carries the modern target: COBOL-on-Kubernetes, Spring Boot 3, .NET 8).
 ## How it was made
 1. **Pruned** the MoE experts 256 → 77 by **router-weighted saliency (REAP** = `router_weight ×
    activation_norm`, padding-masked), streaming layer-by-layer (~5 GB working set — it never fits in RAM).
 2. **Quantized** mixed-precision (MLX): experts **3-bit**, attention/embeddings/lm_head **4-bit** → **99 GB**.
-3. **Healed** with **LoRA SFT** (`--no-mask-prompt`, grad-checkpointed, **`--max-seq-length 2048`** — above that
-   GLM-5.2's DSA sparse-attention scatter is non-differentiable and the backward crashes). A **code-first
-   balanced calibration** keeps the *math* super-experts alive through the prune; the **soul** heal then makes
-   it elite across all facets. *(GRPO/RLVR was tried and regressed → SFT.)*
 ## What makes it different (built + selftested)
 - **Verified decoding (compiler-steered):** generates line-by-line while the **real type-checker runs in
@@ -85,158 +33,84 @@ intentionally old (and also carries the modern target: COBOL-on-Kubernetes, Spri
   Practical *only* on Apple Silicon — unified memory lets the model (GPU) and compiler (CPU) share RAM.
 - **The verifier mesh:** every output meets its real tool — compile+run+**idiomatic lint** (clippy/ruff/
   gofmt/prettier) for 5 langs, **SQL** (sqlite), **math** (SymPy), **proofs** (**Lean 4**), design (render+see).
-- **A 51-tool agent** with **five defense layers** the frontier lacks out of the box: trust (checkpoint/rollback,
-  secret-scan, prompt-injection guard, audit, risk-gate), reliability (constraint-pinning, false-success guard,
-  flaky-test re-run), self-improvement (skill library, clarify-before-assuming), integrity (test-tamper guard,
-  fabrication-proof `done`, slopsquat guard), plus a **humanizer** (kills AI-slop, matches your voice).
-- **Own your repo:** `scripts/64_own_your_repo.py` fine-tunes on *your* private codebase — a cloud flagship can't.
 - **Design soul** (render-and-measure critic: WCAG/type-scale/OKLCH), **CallSieve** zero-token retrieval +
   live-docs RAG, **vision/voice/video** (all MLX), code-rendered math/arch figures (matplotlib/manim/TikZ).
-## Features — everything that's built
-The bet isn't "highest SWE-bench" — it's **the most reliable** local agentic coder, elite across the whole stack.
-Every item below is **built + selftested** (not roadmap; the roadmap is its own section). Receipts live in the linked docs.
-### The demolition
-- **743B → 99 GB.** `zai-org/GLM-5.2` (743B MoE, ~381 GB at 4-bit / ~1.5 TB bf16) demolished to **99 GiB at q3a4**
-  (experts **3-bit**, attention/embeddings/lm_head **4-bit**) — runs **fully on one 128 GB M5 Max**.
-- **REAP prune 256 → 77 experts** by router-weighted saliency (`router_weight × activation_norm`, padding-masked),
-  streamed layer-by-layer (~5 GB working set — it never fits in RAM).
-- **NVFP4 re-quant wired** (`24b_stream_requantize --nvfp4`, `04b --bit-choices`) — half the 3-bit error and the
-  M5 2× hardware path; the **#59** saliency-dynamic quant prep is in place behind the factory.
-### The agentic-reliability moat
-- **51-tool ReAct agent** with trajectory compaction + stall detection for long-horizon runs.
-- **Grammar-constrained tool-JSON** — invalid tokens get zero probability at each step, so a malformed tool-call is
-  **structurally impossible** (vs the field's best: "fewer malformed"). Speaks 2026 strict-schema + MCP conventions.
-- **Verified / compiler-steered decoding** — the real type-checker runs in the loop and a line that adds an error is
-  backtracked **as it's written** (TS 0.3 ms · Python ~0 ms · Rust 34 ms per check).
-- **Fabrication-proof `done`** — the agent **re-runs the original tests** before claiming success; it can't hallucinate a pass.
-- **Integrity layer** — test-tamper guard, **16-provider secret-scan**, scope enforcement, slopsquat guard.
-- **The verifier mesh** — every output meets its real tool: compile+run+idiomatic-lint (clippy/ruff/gofmt/prettier) for
-  **5 langs**, **SQL** (sqlite), **math** (SymPy), **proofs** (**Lean 4**), and a **design render-critic** (render+see).
-### Multi-tier M5 Max hardware use (every tier earns its keep, in parallel)
-- **GPU** — decode + NVFP4 + image-gen.
-- **CPU (18 cores)** — runs the whole verify-everything stack in parallel: `verify_many` fans the verifier mesh across
-  all 18 cores (**measured 6.6×**) and feeds proof-search.
-- **ANE (16-core Neural Engine)** — embeddings via Apple `NLContextualEmbedding` (`src/ane_embed.py`, `backend=ane`,
-  no coremltools): **768-dim, ~9.5 ms**, verified.
-- **SSD** — warm-start **prompt-cache persistence** (`save()/load()` + keyed warm-start; round-trip selftest PASS).
-### Breadth — the 10-facet soul
-- **One always-on soul** makes the model *elite*, not just correct, across **design · dataviz · prose · math ·
-  research · architecture · security (purple-team) · science · perfumery** — trained on master-grounded gold,
-  per-facet eliteness-audited, secure-by-default, with code preserved (HumanEval held at **116/164 = 70.7%**).
-- **Formal-math Lean prover** (`66_prove`) — local Lean-4 prover lane: **miniF2F-test 32/226 = 14.2% pass@4**,
-  **Lean-verified**, contamination-checked.
-### Multimodal stack (all MLX)
-- **Vision** (Qwen3-VL-4B-8bit), **image-gen**, **video**, and **structured tools** — plus code_intel across 5 langs.
-### The model factory
-- **Swappable domain adapters** on one base (download once): each capability is a ~500 MB LoRA. **Pattern A =
-  base + module gold** (a game dev and an AI engineer share the *same* elite soul; only the code module swaps).
-- **Shipped souls:** **soul2 ✓** and **soul-v3 ✓** (on HF); `heal_queue.sh` driver is autonomous.
-- **In the heal queue:** `fullstack` (healing now) → `gamedev` → `legacy` → FACTORY_DONE.
 ## Requirements
 - **Apple Silicon, 128 GB** unified memory (M5-class recommended), macOS 26/27+. **MLX ≥ 0.31.**
-- The architecture (`glm_moe_dsa`: MLA + DSA sparse attention) needs the **bundled patch** (`glm_moe_dsa.py`
-  + `install_glm_dsa_patch.py`) — current stock mlx_lm can't load it. **Native support is landing upstream**
-  ([ml-explore/mlx-lm PR #1410](https://github.com/ml-explore/mlx-lm/pull/1410)); once it merges, recent mlx_lm
-  loads with **no patch**.
-- **⚠️ Raise the GPU memory ceiling — required.** The model needs ~101.6 GB; macOS caps the GPU working set at
-  ~110 GB by default, so it OOM-crashes on long generations. Fix before serving:
   ```bash
   sudo sysctl iogpu.wired_limit_mb=122000        # 122 GB; one-shot (resets on reboot)
   sudo bash dist/install_gpu_limit.sh            # OR: persist it via a LaunchDaemon
   ```
 ## Use it
 ```bash
 python dist/install_glm_dsa_patch.py          # patch mlx_lm (venv AND LM Studio's bundled engine)
-# serve the base + the soul (the swappable adapter is how you pick the specialty):
 GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
-    --adapter-path adapters-soul2             # the masters-trained core soul
-# query it — enable_thinking toggles the reasoning trace (off = faster, on = harder problems):
-curl -s localhost:8080/v1/chat/completions -H 'Content-Type: application/json' \
-  -d '{"messages":[{"role":"user","content":"Write a typed debounce in TypeScript."}],"chat_template_kwargs":{"enable_thinking":true}}'
-# drive the 51-tool agent on your repo:
 python scripts/57_tool_agent.py --repo /path/to/your/repo --apply --task "..." --test "cargo test"
 ```
-In **LM Studio**: run the patch, fully quit + reopen, then load the model. **Speed:** single-stream is
-memory-capped at ~11–14 tok/s — ALL speculative methods measured-DEAD on this MoE (see `SPEED.md`); throughput = batching.
-## Performance (M5 Max 128 GB)
 | Metric | Value |
 |---|---|
-| Size | 99 GiB / ~106 GB (from 381 GB mxfp4 / ~1.5 TB bf16) |
-| HumanEval pass@1 | **116/164 (70.7%)** — full benchmark, single-shot, hidden-test scored; **held across the soul-v2 heal** |
-| Math GSM8K | **8/12 (66%)** — small held-out subset; note: the verbose-CoT model needs a tighter answer-parser for the full set |
-| miniF2F-test (formal proof) | **32/226 (14.2%)** — pass@4, Lean-verified, contamination-checked; a general model, NOT a specialized prover |
 | Algebra (SymPy-checked) | **3/4 (75%)** |
-| Decode speed (single-stream) | **~11–14 tok/s** — the memory floor; speculative measured-dead ([SPEED.md](SPEED.md)) |
-| Batched throughput | **2.6× at B=8** (15.8→41.1 tok/s) · 1.74× at B=6 on the live serve — concurrent requests batch natively |
-**Speed in one line:** single-stream is memory-capped at **~11–14 tok/s** — every speculative method (MTP / EAGLE /
-prompt-lookup / dsa-block-size) was *measured* and is dead on this memory-bound MoE. The real win is **batching: a
-measured 2.6× throughput**, which `mlx_lm.server` delivers **natively** on concurrent requests. Receipts: [`SPEED.md`](SPEED.md).
-**Benchmark honesty:** HumanEval is the **full 164** (116/164 = 70.7%, single-shot); GSM8K (**n=12**) is a **small
-held-out subset**; miniF2F **is** the full 226. Every number is **contamination-checked** (0% / 0% / 0.4% near-dup) —
-**reasoned, not memorized**. Honest frontier-vs-us comparison + projections: [`BENCHMARKS.md`](BENCHMARKS.md).
-## The factory — swappable souls & code, one base
-The spider→gold→heal recipe is **domain-agnostic**: "make a model elite at X" is now a repeatable procedure.
-On the one 99 GB base (downloaded once), each new capability is a ~500 MB adapter:
-- **Core soul** — design · dataviz · prose · math · research · architecture · security (purple-team) · science · perfumery.
-- **Code modules (swap one)** — `fullstack/AI-eng/DS-ML` (RAG, agents, MLOps, deep-learning, data-eng, web/devops) ·
-  `game/app` (Unreal C++/Blueprints, Unity C#, Godot GDScript, Flutter/Dart, Nystrom patterns, shaders, netcode) ·
-  `legacy` (COBOL/mainframe, enterprise Java, PHP — classic **and** modernized to Java 21 / PHP 8.4 / .NET 8 / COBOL-on-K8s).
-- Verified by design: each code module's gold targets a **compile-verification** pass (the leap Lean gave miniF2F).
-**Swap a module + build a new specialty:** full mechanics — runtime swap · the two soul-merge patterns ·
-the spider→gold→heal recipe · the rules — are in [`FACTORY.md`](FACTORY.md). The model is also being taught
-its *own* factory (route a task → the right module, emitting a `<module>…</module>` signal), so it can self-select the specialty.
-## Roadmap — what's queued next
-Honest queue (the live kanban is `BACKLOG.md`). These are **not built yet** — the Features section above is:
-- **ANE vision (#79)** — move the vision encoder onto the Neural Engine (the big ANE win; convert-friendly model).
-- **ANE speech (#87)** — Whisper / `SFSpeechRecognizer` on the ANE — a voice lane.
-- **SSD-backed long-context KV (#86)** — KV offload to the 14.5 GB/s SSD for long context (attacks our weakest axis
-  vs 1M-ctx rivals; the #85 prompt-cache plumbing is already done).
-- **Metal-4 TensorOps fused-MoE kernel (#81)** — custom fused kernel, the new M5 decode lever (**~30–60% decode**,
-  `research/mlx_speed_deepdive.md`); distinct from the (dead) speculative methods.
-- **#59 NVFP4 collapse-fix** — saliency-dynamic quant (early/late experts at 4-bit) to cure long-gen Computation Collapse;
-  tooling is wired, GPU-gated behind the factory.
-- **Agentic-gold heal (#84)** — heal the 23 staged agentic-gold examples into the soul.
-## Roadmap — the Demolition family (shrink, keep the soul)
-Same masters-trained soul, every Mac — the elite training lives in the size-agnostic calibration + heal corpus:
-```
- ~106GB : ████████  77 experts · 3-bit   (this model)   → 128 GB Mac
-   67GB : ██████    46 experts · 3-bit                   → 96 GB Mac
-   55GB : █████     36 experts · 3-bit                   → 64 GB Mac
-   36GB : ███       26 experts · 2.5-bit                 → 48 GB Mac
-   20GB : ██        16 experts · 2-bit  ⚗️                → 32 GB Mac
-   14GB : █          8 experts · 2-bit  ⚗️ (the floor)    → 24 GB Mac
-```
-Sizes **measured** from the build: **~10.4 GB fixed base** + experts × ~1.24 GB × bits/3. The base dominates,
-so **below ~13 GB is impossible** — the right column is your **minimum Mac RAM**.
 ## Honest limitations
-- **Specialist base:** ~70% of experts pruned — strong in the trained facets, weaker on long-tail trivia. Not the full 743B.
-- **Speed ~11–14 tok/s decode — the memory floor.** Every speculative lever was benchmarked and is dead here
-  (proven 4 ways — [`SPEED.md`](SPEED.md)): MTP **0%**, external/prompt-lookup draft **0.32×**, dsa-block-size **flat**.
-  The real "faster" is **throughput via batching (2.6× at B=8)**. A fresh EAGLE-3 head is the only single-stream path and is **not** recommended.
-- **Raw single-shot arithmetic** is the weak spot (the model reasons *very* verbosely on math) — its **structured/formal**
-  math (miniF2F via the Lean prover) is far stronger. The GSM8K subset needs a tighter answer-parser to measure cleanly.
-- **The soul is a LoRA, not magic** — evaluate the per-facet soul-retention scorecard before relying on a facet; the
-  swappable code modules (game/app, legacy) have their **gold built** and are **healing into adapters** (the factory's next output).
-- **Multilingual** ability reduced (optional vocab-trim drops ~31% of tokens). Prompt-cache can OOM under heavy concurrent load.
 ## Attribution & license
 **MIT.** Base model © **Z.ai** (`zai-org/GLM-5.2`, MIT-licensed) — so this derivative is MIT too: free
-to use, modify, and redistribute **with attribution to Z.ai**. The demolition / healing / soul / 51-tool agent
 tooling is this repo's contribution.

 ---
 license: mit
 base_model: zai-org/GLM-5.2
 library_name: mlx
 pipeline_tag: text-generation
 language: [en]
+tags: [mlx, moe, code, agentic, glm, pruned, quantized, verified-decoding, apple-silicon, local-agent]
 ---
+# GLM-5.2-Demolition — a 743B frontier MoE, demolished to run on a 128 GB Mac
 **One line:** we took `zai-org/GLM-5.2` (743B-parameter Mixture-of-Experts, ~381 GB at 4-bit) and
+demolished it to **99 GB** so it runs **fully on-device on a MacBook Pro M5 Max (128 GB)** — then
+healed it and wrapped it in a **47-tool local agent** that does things a cloud model structurally
+cannot: the **compiler steers every line it writes**, it **can't fake a passing test or leak a
+secret**, and it can be **fine-tuned on *your* private repo** so it writes in your style.
+A **niche specialist**, not a general model — tuned to beat a frontier model *in one lane* (agentic
+coding + design for **TS/JS/Python/Rust/Go/HTML/CSS** + Postgres) by out-*verifying* it, not out-knowing it.
 ## How it was made
 1. **Pruned** the MoE experts 256 → 77 by **router-weighted saliency (REAP** = `router_weight ×
    activation_norm`, padding-masked), streaming layer-by-layer (~5 GB working set — it never fits in RAM).
 2. **Quantized** mixed-precision (MLX): experts **3-bit**, attention/embeddings/lm_head **4-bit** → **99 GB**.
+3. **Healed** with **LoRA SFT** (`--no-mask-prompt`, grad-checkpointed). The current **v4** rebuild uses a
+   **code-first balanced calibration** (so the *math* super-experts survive the prune — v3's coding-only
+   calibration collapsed math) + heal/distill on **R1 long-CoT reasoning traces**. Router-KD / expert-wise
+   Logit-KD are research-validated recovery stages (optional). *(GRPO/RLVR was tried and regressed → SFT.)*
 ## What makes it different (built + selftested)
 - **Verified decoding (compiler-steered):** generates line-by-line while the **real type-checker runs in
   Practical *only* on Apple Silicon — unified memory lets the model (GPU) and compiler (CPU) share RAM.
 - **The verifier mesh:** every output meets its real tool — compile+run+**idiomatic lint** (clippy/ruff/
   gofmt/prettier) for 5 langs, **SQL** (sqlite), **math** (SymPy), **proofs** (**Lean 4**), design (render+see).
+- **A 47-tool agent** with **five defense layers** the frontier lacks out of the box:
+  **trust** (checkpoint/rollback, secret-scan, prompt-injection guard, audit, risk-gate),
+  **reliability** (constraint-pinning vs context-rot, false-success guard, flaky-test re-run, onboarding map),
+  **self-improvement** (skill library, large-output pointers, clarify-before-assuming),
+  **integrity** (test-tamper guard, fabrication-proof `done`, scope enforcement, slopsquat guard),
+  plus a **humanizer** (kills AI-slop, matches your voice).
+- **Own your repo:** `scripts/64_own_your_repo.py` fine-tunes the model on *your* private codebase so it
+  writes in your style — a cloud flagship can't be tuned on your private code.
 - **Design soul** (render-and-measure critic: WCAG/type-scale/OKLCH), **CallSieve** zero-token retrieval +
   live-docs RAG, **vision/voice/video** (all MLX), code-rendered math/arch figures (matplotlib/manim/TikZ).
+## Every chip on the M5 Max, working
+The agent spreads perception, verification, and dispatch across **all six compute blocks** so the GPU stays
+free for token generation (built + selftested):
+- **GPU** (40-core + M5 Neural Accelerators) — the 99 GB model decodes + LoRA-heals.
+- **Neural Engine** (16-core) — embeddings · OCR · image segmentation / pose / object-detection · NER+POS ·
+  audio classification + VAD · neural TTS · zero-shot routing · rerank — all via Apple frameworks, no CoreML, no GPU.
+- **18 CPU cores** — the verifier mesh fanned out (`verify_many`, 6.6×) · 9-language compile-verify · tabular ML.
+- **Media Engine** — hardware H.264/HEVC/AV1 decode + encode for the video lane.
+- **AMX/SME** — matrix coprocessor via Accelerate (~2.1 TFLOP/s f32), implicit in every numpy op.
+- **ASR** = **Whisper on MLX** (no mic-permission needed). An **Any-to-Any omni-router** sends any input
+  (text / image / audio / video / table) to its optimal block.
+## The model factory (swappable domain souls)
+One 99 GB base + hot-swappable LoRA "souls" (~100 MB each) — change the model's specialty by swapping the
+adapter: **code · design · agentic · gamedev · legacy/enterprise · security · fullstack · science · data ·
+perfumery**. Each is healed from the same base by an autonomous chain that forges the whole library overnight
+on the one Mac — and a `factory`-dispatcher soul makes the model route requests to the right specialty itself.
 ## Requirements
 - **Apple Silicon, 128 GB** unified memory (M5-class recommended), macOS 26/27+. **MLX ≥ 0.31.**
+- The architecture (`glm_moe_dsa`: MLA + DSA sparse attention) needs the **bundled patches** — stock
+  mlx_lm can't load it.
+- **⚠️ Raise the GPU memory ceiling — required.** The model needs ~101.6 GB; macOS caps the GPU
+  working set at ~110 GB by default, so it OOM-crashes (Metal command-buffer timeout) on long
+  generations. Fix before serving:
   ```bash
   sudo sysctl iogpu.wired_limit_mb=122000        # 122 GB; one-shot (resets on reboot)
   sudo bash dist/install_gpu_limit.sh            # OR: persist it via a LaunchDaemon
   ```
+  Without this the model appears to "randomly crash" — it's just memory-starved.
 ## Use it
 ```bash
 python dist/install_glm_dsa_patch.py          # patch mlx_lm (venv AND LM Studio's bundled engine)
 GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
+    --adapter-path heal/adapters-v4           # serve (OpenAI-compatible); v2 + heal/adapters also ship
+# drive the 47-tool agent on your repo:
 python scripts/57_tool_agent.py --repo /path/to/your/repo --apply --task "..." --test "cargo test"
+# speed: try --dsa-block-size 32/64/128 (free, pick fastest). External draft is Metal-unstable here; MTP self-spec is the real path.
 ```
+In **LM Studio**: run the patch, fully quit + reopen, then load the model.
+## Performance (M5 Max 128 GB, v4)
 | Metric | Value |
 |---|---|
+| Size | 99 GB (from 381 GB mxfp4 / ~1.5 TB bf16) |
+| HumanEval pass@1 | **19/20 (95%)**, single-shot |
+| Math GSM8K | **8/12 (66%)** — recovered from v3's **0/5** (code-first balanced calibration kept the math super-experts alive through the prune) |
 | Algebra (SymPy-checked) | **3/4 (75%)** |
+| Decode speed | **11.3 tok/s** (no draft) — see the speed note in limitations |
+| Verified-decode checker | TS 0.3 ms · Python ~0 ms · Rust 34 ms |
 ## Honest limitations
+- **Specialist:** ~70% of experts pruned — strong in the target niche, weaker outside it. Not the full 743B.
+- **Speed ~11 tok/s decode** (reading pace; ~3 min for long thinking-ON answers). Partly MLX's still-naive
+  **DSA attention kernels** (mlx #837 / #3402 — *improves for free* as MLX matures), partly the bandwidth
+  cost of a 743B-class MoE on a laptop. **Measured dead-ends** (don't bother): 4-bit re-quant is *slower*
+  for single-token decode (bandwidth-bound, smaller wins); active-experts 8→4 gives no win at batch=1.
+  **Real path:** `--dsa-block-size` sweep (free) → upstream MLX → **MTP self-speculative** (~2.6×, a port
+  for this arch). Not a quant change.
+- **Multilingual** ability reduced (optional vocab-trim drops ~31% of tokens).
+- **Design** is competent but not yet design-soul-elite (correct structure, but missed OKLCH/grid when
+  tested) — the design-canon heal closes this.
+- Prompt-cache can OOM under heavy concurrent load. The external speculative draft is **Metal-unstable**
+  on this MoE — **MTP self-speculative is the right path**; the external draft is not recommended.
 ## Attribution & license
 **MIT.** Base model © **Z.ai** (`zai-org/GLM-5.2`, MIT-licensed) — so this derivative is MIT too: free
+to use, modify, and redistribute **with attribution to Z.ai**. The demolition / healing / 47-tool agent
 tooling is this repo's contribution.