Text Generation
MLX
Safetensors
English
glm_moe_dsa
apple-silicon
Mixture of Experts
pruned
quantized
soul-targeted
agentic
local-agent
glm
conversational
Eval Results (legacy)
4-bit precision
Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX
Run Hermes
hermes
- MLX LM
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
| license: mit | |
| base_model: zai-org/GLM-5.2 | |
| base_model_relation: quantized | |
| library_name: mlx | |
| pipeline_tag: text-generation | |
| language: [en] | |
| tags: [mlx, moe, code, agentic, glm, pruned, quantized, verified-decoding, apple-silicon, local-agent, conversational, soul, design, security, multi-domain] | |
| datasets: | |
| - open-r1/Mixture-of-Thoughts | |
| - open-r1/OpenR1-Math-220k | |
| - open-thoughts/OpenThoughts-114k | |
| - HuggingFaceH4/ultrachat_200k | |
| - theblackcat102/evol-codealpaca-v1 | |
| - Salesforce/xlam-function-calling-60k | |
| - glaiveai/glaive-function-calling-v2 | |
| - SWE-bench/SWE-smith-trajectories | |
| - internlm/Lean-Workbook | |
| # GLM-5.2-Demolition β a 743B frontier MoE on a 128 GB Mac, with a masters-trained soul | |
|  | |
| **One line:** we took `zai-org/GLM-5.2` (743B-parameter Mixture-of-Experts, ~381 GB at 4-bit) and | |
| demolished it to **99 GiB (~106 GB)** so it runs **fully on-device on a MacBook Pro M5 Max (128 GB)** β then | |
| healed it toward **the actual masters of every field** and wrapped it in a **51-tool local agent** that does | |
| things a cloud model structurally cannot: the **compiler steers every line it writes**, it | |
| **re-verifies tests on `done`**, and it **blocks known-format secret writes**. | |
| It is no longer one niche model. It is **one 99 GB base + a deepening, masters-trained "soul" + a growing | |
| library of swappable code specialties** β a small **model factory** that runs on a single Mac. | |
| ## The shape: one base, a soul, swappable code | |
| ``` | |
| CORE SOUL (always-on, baked toward the masters) SWAPPABLE CODE MODULE (the dial) | |
| ββ design Β· dataviz Β· prose Β· math ββ fullstack / AI-eng / DS-ML (default) | |
| ββ research Β· architecture Γ ββ game / app dev (UnrealΒ·UnityΒ·GodotΒ·Flutter) | |
| ββ SECURITY β purple-team: crypto Β· web Β· net Β· ββ legacy / enterprise (COBOLΒ·JavaΒ·PHP β old AND modern) | |
| β secure-code Β· blue-team + red-team/pentest/CTF | |
| ββ science (physics Β· chem Β· bio) | |
| ββ perfumery | |
| ``` | |
| - The **base** (99 GB) is built once and never changes β it's the expensive part (pruning + quantizing 743B). | |
| - The **soul** is a small (~500 MB) LoRA that makes the model *elite* β not just correct β at every facet, | |
| trained on gold spidered from the people who *defined* each field (Rams/MΓΌller-Brockmann for design, | |
| Kernighan/Knuth for code, ErdΕs/PΓ³lya for math, Tufte for dataviz, Saltzer-Schroeder for security, | |
| Strunk/Orwell for prose, Parnas/Uncle-Bob for architecture, Feynman/Popper for research). | |
| - The **code module** is a swappable ~500 MB adapter: a game dev and an AI engineer load the *same* elite | |
| design/prose/math/security β only the coding expertise changes. New market = one small adapter, not a new base. | |
| ## The soul, and how it's built | |
| The demolished base reverts to the *average* of its training. To make it **elite**, we don't ask it to | |
| imitate itself (that degenerates) β we **research the masters** and heal toward them: | |
| > **spider the elite canon of a field β generate audit-gated, secure-by-default gold β heal a LoRA β scorecard** | |
| The current core soul (`adapters-soul2`) is **250 masters-grounded examples across 8 facets** β every one | |
| `json.dumps`-clean, gated by a per-facet eliteness audit (with a degeneration guard), and **secure-by-default** | |
| (parameterized queries, AEAD crypto, no hardcoded secrets, validated input). The heal **preserved code** | |
| (HumanEval held at 116/164 = 70.7%, identical to the prior soul) while adding the full facet breadth. | |
| **Design ranges from restraint to maximalism** β Swiss minimalism (Rams Β· MΓΌller-Brockmann Β· Vignelli) *and* | |
| pop-street (Warhol Β· Banksy Β· Mr-Brainwash Β· Murakami), plus Bauhaus, editorial, product-systems, and | |
| experimental/brutalist movements. **Security is full purple-team** β defensive core (crypto/web/net/secure-coding/ | |
| blue-team) **and** authorized red-team/pentest/CTF (every offensive technique paired with its detection + | |
| hardening). **Math** spans Furstenberg β Ramsey β Zagier with Lean-4 proofs. Everything uses **current versions** | |
| (React 19 Β· PyTorch 2.x Β· OWASP 2025 Β· CVE-2025 Β· Java 21 Β· PHP 8.4) β *except* the legacy module, which is | |
| intentionally old (and also carries the modern target: COBOL-on-Kubernetes, Spring Boot 3, .NET 8). | |
| ## How it was made | |
| 1. **Pruned** the MoE experts 256 β 77 by **router-weighted saliency (REAP** = `router_weight Γ | |
| activation_norm`, padding-masked), streaming layer-by-layer (~5 GB working set β it never fits in RAM). | |
| 2. **Quantized** mixed-precision (MLX): experts **3-bit**, attention/embeddings/lm_head **4-bit** β **99 GB**. | |
| 3. **Healed** with **LoRA SFT** (`--no-mask-prompt`, grad-checkpointed, **`--max-seq-length 2048`** β above that | |
| GLM-5.2's DSA sparse-attention scatter is non-differentiable and the backward crashes). A **code-first | |
| balanced calibration** keeps the *math* super-experts alive through the prune; the **soul** heal then makes | |
| it elite across all facets. *(GRPO/RLVR was tried and regressed β SFT.)* | |
| ## What makes it different (built + selftested) | |
| - **Verified decoding (compiler-steered):** generates line-by-line while the **real type-checker runs in | |
| the loop**; a line that adds an error is backtracked. TS 0.3 ms Β· Python ~0 ms Β· Rust 34 ms per check. | |
| Practical *only* on Apple Silicon β unified memory lets the model (GPU) and compiler (CPU) share RAM. | |
| - **The verifier mesh:** every output meets its real tool β compile+run+**idiomatic lint** (clippy/ruff/ | |
| gofmt/prettier) for 5 langs, **SQL** (sqlite), **math** (SymPy), **proofs** (**Lean 4**), design (render+see). | |
| - **A 51-tool agent** with **five defense layers** the frontier lacks out of the box: trust (checkpoint/rollback, | |
| secret-scan, prompt-injection guard, audit, risk-gate), reliability (constraint-pinning, false-success guard, | |
| flaky-test re-run), self-improvement (skill library, clarify-before-assuming), integrity (test-tamper guard, | |
| fabrication-proof `done`, slopsquat guard), plus a **humanizer** (kills AI-slop, matches your voice). | |
| - **Own your repo:** `scripts/64_own_your_repo.py` fine-tunes on *your* private codebase β a cloud flagship can't. | |
| - **Design soul** (render-and-measure critic: WCAG/type-scale/OKLCH), **CallSieve** zero-token retrieval + | |
| live-docs RAG, **vision/voice/video** (all MLX), code-rendered math/arch figures (matplotlib/manim/TikZ). | |
| ## Features β everything that's built | |
| The bet isn't "highest SWE-bench" β it's **the most reliable** local agentic coder, elite across the whole stack. | |
| Every item below is **built + selftested** (not roadmap; the roadmap is its own section). Receipts live in the linked docs. | |
| ### The demolition | |
| - **743B β 99 GB.** `zai-org/GLM-5.2` (743B MoE, ~381 GB at 4-bit / ~1.5 TB bf16) demolished to **99 GiB at q3a4** | |
| (experts **3-bit**, attention/embeddings/lm_head **4-bit**) β runs **fully on one 128 GB M5 Max**. | |
| - **REAP prune 256 β 77 experts** by router-weighted saliency (`router_weight Γ activation_norm`, padding-masked), | |
| streamed layer-by-layer (~5 GB working set β it never fits in RAM). | |
| - **NVFP4 re-quant wired** (`24b_stream_requantize --nvfp4`, `04b --bit-choices`) β half the 3-bit error and the | |
| M5 2Γ hardware path; the **#59** saliency-dynamic quant prep is in place behind the factory. | |
| ### The agentic-reliability moat | |
| - **51-tool ReAct agent** with trajectory compaction + stall detection for long-horizon runs. | |
| - **Grammar-constrained tool-JSON** β invalid tokens get zero probability at each step, so a malformed tool-call is | |
| **structurally impossible** (vs the field's best: "fewer malformed"). Speaks 2026 strict-schema + MCP conventions. | |
| - **Verified / compiler-steered decoding** β the real type-checker runs in the loop and a line that adds an error is | |
| backtracked **as it's written** (TS 0.3 ms Β· Python ~0 ms Β· Rust 34 ms per check). | |
| - **Fabrication-proof `done`** β the agent **re-runs the original tests** before claiming success; it can't hallucinate a pass. | |
| - **Integrity layer** β test-tamper guard, **16-provider secret-scan**, scope enforcement, slopsquat guard. | |
| - **The verifier mesh** β every output meets its real tool: compile+run+idiomatic-lint (clippy/ruff/gofmt/prettier) for | |
| **5 langs**, **SQL** (sqlite), **math** (SymPy), **proofs** (**Lean 4**), and a **design render-critic** (render+see). | |
| ### Multi-tier M5 Max hardware use (every tier earns its keep, in parallel) | |
| - **GPU** β decode + NVFP4 + image-gen. | |
| - **CPU (18 cores)** β runs the whole verify-everything stack in parallel: `verify_many` fans the verifier mesh across | |
| all 18 cores (**measured 6.6Γ**) and feeds proof-search. | |
| - **ANE (16-core Neural Engine)** β embeddings via Apple `NLContextualEmbedding` (`src/ane_embed.py`, `backend=ane`, | |
| no coremltools): **768-dim, ~9.5 ms**, verified. | |
| - **SSD** β warm-start **prompt-cache persistence** (`save()/load()` + keyed warm-start; round-trip selftest PASS). | |
| ### Breadth β the 10-facet soul | |
| - **One always-on soul** makes the model *elite*, not just correct, across **design Β· dataviz Β· prose Β· math Β· | |
| research Β· architecture Β· security (purple-team) Β· science Β· perfumery** β trained on master-grounded gold, | |
| per-facet eliteness-audited, secure-by-default, with code preserved (HumanEval held at **116/164 = 70.7%**). | |
| - **Formal-math Lean prover** (`66_prove`) β local Lean-4 prover lane: **miniF2F-test 32/226 = 14.2% pass@4**, | |
| **Lean-verified**, contamination-checked. | |
| ### Multimodal stack (all MLX) | |
| - **Vision** (Qwen3-VL-4B-8bit), **image-gen**, **video**, and **structured tools** β plus code_intel across 5 langs. | |
| ### The model factory | |
| - **Swappable domain adapters** on one base (download once): each capability is a ~500 MB LoRA. **Pattern A = | |
| base + module gold** (a game dev and an AI engineer share the *same* elite soul; only the code module swaps). | |
| - **Shipped souls:** **soul2 β** and **soul-v3 β** (on HF); `heal_queue.sh` driver is autonomous. | |
| - **In the heal queue:** `fullstack` (healing now) β `gamedev` β `legacy` β FACTORY_DONE. | |
| ## Requirements | |
| - **Apple Silicon, 128 GB** unified memory (M5-class recommended), macOS 26/27+. **MLX β₯ 0.31.** | |
| - The architecture (`glm_moe_dsa`: MLA + DSA sparse attention) needs the **bundled patch** (`glm_moe_dsa.py` | |
| + `install_glm_dsa_patch.py`) β current stock mlx_lm can't load it. **Native support is landing upstream** | |
| ([ml-explore/mlx-lm PR #1410](https://github.com/ml-explore/mlx-lm/pull/1410)); once it merges, recent mlx_lm | |
| loads with **no patch**. | |
| - **β οΈ Raise the GPU memory ceiling β required.** The model needs ~101.6 GB; macOS caps the GPU working set at | |
| ~110 GB by default, so it OOM-crashes on long generations. Fix before serving: | |
| ```bash | |
| sudo sysctl iogpu.wired_limit_mb=122000 # 122 GB; one-shot (resets on reboot) | |
| sudo bash dist/install_gpu_limit.sh # OR: persist it via a LaunchDaemon | |
| ``` | |
| ## Use it | |
| ```bash | |
| python dist/install_glm_dsa_patch.py # patch mlx_lm (venv AND LM Studio's bundled engine) | |
| # serve the base + the soul (the swappable adapter is how you pick the specialty): | |
| GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \ | |
| --adapter-path adapters-soul2 # the masters-trained core soul | |
| # query it β enable_thinking toggles the reasoning trace (off = faster, on = harder problems): | |
| curl -s localhost:8080/v1/chat/completions -H 'Content-Type: application/json' \ | |
| -d '{"messages":[{"role":"user","content":"Write a typed debounce in TypeScript."}],"chat_template_kwargs":{"enable_thinking":true}}' | |
| # drive the 51-tool agent on your repo: | |
| python scripts/57_tool_agent.py --repo /path/to/your/repo --apply --task "..." --test "cargo test" | |
| ``` | |
| In **LM Studio**: run the patch, fully quit + reopen, then load the model. **Speed:** single-stream is | |
| memory-capped at ~11β14 tok/s β ALL speculative methods measured-DEAD on this MoE (see `SPEED.md`); throughput = batching. | |
| ## Performance (M5 Max 128 GB) | |
| | Metric | Value | | |
| |---|---| | |
| | Size | 99 GiB / ~106 GB (from 381 GB mxfp4 / ~1.5 TB bf16) | | |
| | HumanEval pass@1 | **116/164 (70.7%)** β full benchmark, single-shot, hidden-test scored; **held across the soul-v2 heal** | | |
| | Math GSM8K | **8/12 (66%)** β small held-out subset; note: the verbose-CoT model needs a tighter answer-parser for the full set | | |
| | miniF2F-test (formal proof) | **32/226 (14.2%)** β pass@4, Lean-verified, contamination-checked; a general model, NOT a specialized prover | | |
| | Algebra (SymPy-checked) | **3/4 (75%)** | | |
| | Decode speed (single-stream) | **~11β14 tok/s** β the memory floor; speculative measured-dead ([SPEED.md](SPEED.md)) | | |
| | Batched throughput | **2.6Γ at B=8** (15.8β41.1 tok/s) Β· 1.74Γ at B=6 on the live serve β concurrent requests batch natively | | |
| **Speed in one line:** single-stream is memory-capped at **~11β14 tok/s** β every speculative method (MTP / EAGLE / | |
| prompt-lookup / dsa-block-size) was *measured* and is dead on this memory-bound MoE. The real win is **batching: a | |
| measured 2.6Γ throughput**, which `mlx_lm.server` delivers **natively** on concurrent requests. Receipts: [`SPEED.md`](SPEED.md). | |
| **Benchmark honesty:** HumanEval is the **full 164** (116/164 = 70.7%, single-shot); GSM8K (**n=12**) is a **small | |
| held-out subset**; miniF2F **is** the full 226. Every number is **contamination-checked** (0% / 0% / 0.4% near-dup) β | |
| **reasoned, not memorized**. Honest frontier-vs-us comparison + projections: [`BENCHMARKS.md`](BENCHMARKS.md). | |
| ## The factory β swappable souls & code, one base | |
| The spiderβgoldβheal recipe is **domain-agnostic**: "make a model elite at X" is now a repeatable procedure. | |
| On the one 99 GB base (downloaded once), each new capability is a ~500 MB adapter: | |
| - **Core soul** β design Β· dataviz Β· prose Β· math Β· research Β· architecture Β· security (purple-team) Β· science Β· perfumery. | |
| - **Code modules (swap one)** β `fullstack/AI-eng/DS-ML` (RAG, agents, MLOps, deep-learning, data-eng, web/devops) Β· | |
| `game/app` (Unreal C++/Blueprints, Unity C#, Godot GDScript, Flutter/Dart, Nystrom patterns, shaders, netcode) Β· | |
| `legacy` (COBOL/mainframe, enterprise Java, PHP β classic **and** modernized to Java 21 / PHP 8.4 / .NET 8 / COBOL-on-K8s). | |
| - Verified by design: each code module's gold targets a **compile-verification** pass (the leap Lean gave miniF2F). | |
| **Swap a module + build a new specialty:** full mechanics β runtime swap Β· the two soul-merge patterns Β· | |
| the spiderβgoldβheal recipe Β· the rules β are in [`FACTORY.md`](FACTORY.md). The model is also being taught | |
| its *own* factory (route a task β the right module, emitting a `<module>β¦</module>` signal), so it can self-select the specialty. | |
| ## Roadmap β what's queued next | |
| Honest queue (the live kanban is `BACKLOG.md`). These are **not built yet** β the Features section above is: | |
| - **ANE vision (#79)** β move the vision encoder onto the Neural Engine (the big ANE win; convert-friendly model). | |
| - **ANE speech (#87)** β Whisper / `SFSpeechRecognizer` on the ANE β a voice lane. | |
| - **SSD-backed long-context KV (#86)** β KV offload to the 14.5 GB/s SSD for long context (attacks our weakest axis | |
| vs 1M-ctx rivals; the #85 prompt-cache plumbing is already done). | |
| - **Metal-4 TensorOps fused-MoE kernel (#81)** β custom fused kernel, the new M5 decode lever (**~30β60% decode**, | |
| `research/mlx_speed_deepdive.md`); distinct from the (dead) speculative methods. | |
| - **#59 NVFP4 collapse-fix** β saliency-dynamic quant (early/late experts at 4-bit) to cure long-gen Computation Collapse; | |
| tooling is wired, GPU-gated behind the factory. | |
| - **Agentic-gold heal (#84)** β heal the 23 staged agentic-gold examples into the soul. | |
| ## Roadmap β the Demolition family (shrink, keep the soul) | |
| Same masters-trained soul, every Mac β the elite training lives in the size-agnostic calibration + heal corpus: | |
| ``` | |
| ~106GB : ββββββββ 77 experts Β· 3-bit (this model) β 128 GB Mac | |
| 67GB : ββββββ 46 experts Β· 3-bit β 96 GB Mac | |
| 55GB : βββββ 36 experts Β· 3-bit β 64 GB Mac | |
| 36GB : βββ 26 experts Β· 2.5-bit β 48 GB Mac | |
| 20GB : ββ 16 experts Β· 2-bit βοΈ β 32 GB Mac | |
| 14GB : β 8 experts Β· 2-bit βοΈ (the floor) β 24 GB Mac | |
| ``` | |
| Sizes **measured** from the build: **~10.4 GB fixed base** + experts Γ ~1.24 GB Γ bits/3. The base dominates, | |
| so **below ~13 GB is impossible** β the right column is your **minimum Mac RAM**. | |
| ## Honest limitations | |
| - **Specialist base:** ~70% of experts pruned β strong in the trained facets, weaker on long-tail trivia. Not the full 743B. | |
| - **Speed ~11β14 tok/s decode β the memory floor.** Every speculative lever was benchmarked and is dead here | |
| (proven 4 ways β [`SPEED.md`](SPEED.md)): MTP **0%**, external/prompt-lookup draft **0.32Γ**, dsa-block-size **flat**. | |
| The real "faster" is **throughput via batching (2.6Γ at B=8)**. A fresh EAGLE-3 head is the only single-stream path and is **not** recommended. | |
| - **Raw single-shot arithmetic** is the weak spot (the model reasons *very* verbosely on math) β its **structured/formal** | |
| math (miniF2F via the Lean prover) is far stronger. The GSM8K subset needs a tighter answer-parser to measure cleanly. | |
| - **The soul is a LoRA, not magic** β evaluate the per-facet soul-retention scorecard before relying on a facet; the | |
| swappable code modules (game/app, legacy) have their **gold built** and are **healing into adapters** (the factory's next output). | |
| - **Multilingual** ability reduced (optional vocab-trim drops ~31% of tokens). Prompt-cache can OOM under heavy concurrent load. | |
| ## Attribution & license | |
| **MIT.** Base model Β© **Z.ai** (`zai-org/GLM-5.2`, MIT-licensed) β so this derivative is MIT too: free | |
| to use, modify, and redistribute **with attribution to Z.ai**. The demolition / healing / soul / 51-tool agent | |
| tooling is this repo's contribution. | |