Text Generation
MLX
Safetensors
English
glm_moe_dsa
apple-silicon
Mixture of Experts
pruned
quantized
soul-targeted
agentic
local-agent
glm
conversational
Eval Results (legacy)
4-bit precision
Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX
Run Hermes
hermes
- OpenClaw new
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with OpenClaw:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" \ --custom-provider-id mlx-lm \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- MLX LM
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
| license: mit | |
| base_model: zai-org/GLM-5.2 | |
| library_name: mlx | |
| pipeline_tag: text-generation | |
| language: [en] | |
| tags: [mlx, moe, code, agentic, glm, pruned, quantized, verified-decoding, apple-silicon, local-agent] | |
| # GLM-5.2-Demolition β a 743B frontier MoE, demolished to run on a 128 GB Mac | |
|  | |
| **One line:** we took `zai-org/GLM-5.2` (743B-parameter Mixture-of-Experts, ~381 GB at 4-bit) and | |
| demolished it to **99 GB** so it runs **fully on-device on a MacBook Pro M5 Max (128 GB)** β then | |
| healed it and wrapped it in a **51-tool local agent** that does things a cloud model structurally | |
| cannot: the **compiler steers every line it writes**, it **can't fake a passing test or leak a | |
| secret**, and it can be **fine-tuned on *your* private repo** so it writes in your style. | |
| A **niche specialist**, not a general model β tuned to beat a frontier model *in one lane* (agentic | |
| coding + design for **TS/JS/Python/Rust/Go/HTML/CSS** + Postgres) by out-*verifying* it, not out-knowing it. | |
| ## My AI-Engineer / Full-Stack / Data-Science / ML build | |
| This is the version I run wearing all four hats β **one on-device model, no cloud key**, tooled for the | |
| whole stack of those roles (strongest in the coding/agentic lane, deliberately so): | |
| - **AI Engineer** β *builds and ships agentic AI locally*: the 51-tool ReAct agent, **verified + | |
| constrained decoding**, grammar-constrained tool I/O, MLX-native serving + the speed/stability work | |
| (prompt-cache, continuous batching, frontier-grade serving). The model that *makes* AI products. | |
| - **Full-Stack** β front-to-back in **TS/JS/Python/Rust/Go/HTML/CSS + Postgres**, the **compiler steering | |
| every line**, a **design soul** (render-and-see critic: WCAG / type-scale / OKLCH) for the UI, and | |
| **SQL-on-a-real-schema** for the API β plus editβtestβfix agentic loops on *your* repo. | |
| - **Data Science** β stateful **REPL**, **SymPy / pandas / numpy / sklearn**, arXiv-RAG, competition-grade | |
| math (GSM8K-style), and **code-rendered figures** (matplotlib / manim / TikZ). | |
| - **Machine Learning** β it *is* applied ML end-to-end: **REAP expert-pruning** (256β77), **mixed-precision | |
| quantization**, **LoRA healing**, **distillation**, **MTP self-speculation**, GRPO/RLVR experiments β the | |
| build itself is a working reference. | |
| β¦**and the hats that fall straight out of "verify-everything":** | |
| - **Security / DevSecOps** β secret-scanning (16 providers: AWS/GitHub/OpenAI/**Anthropic/HuggingFace**/Slack/Stripe/Google/DB-URLs/JWT/PEMβ¦), | |
| prompt-injection guard, test-tamper + **fabrication-proof `done`**, slopsquat/typosquat guard, risk-gated | |
| tools. It structurally **can't leak a key or fake a green test**. | |
| - **Formal-Methods / Verification Engineer** β a local **Lean-4** prover (premise selection, expert-iteration, | |
| self-correction from the *real* Lean error) β **correct-by-construction** math, not vibes. | |
| - **MLOps / Inference** β the serving spine: prompt-cache, continuous batching, watchdog + circuit-breaker + | |
| memory-ceiling β **frontier-grade stability** for hours-long local runs on one box. | |
| - **Multimodal / CV** β reads images + video (VLM), palette-steered **image-gen**, code-rendered | |
| video/figures (**manim/TikZ**) β all MLX. | |
| - **Design Engineer** β a render-and-***see*** critic enforcing **WCAG** contrast, modular type scale, | |
| 8 px grid, **OKLCH** harmony (not just "looks fine"). | |
| One model, fully local, **verify-everything** β every hat above, on a MacBook. | |
| ## How it was made | |
| 1. **Pruned** the MoE experts 256 β 77 by **router-weighted saliency (REAP** = `router_weight Γ | |
| activation_norm`, padding-masked), streaming layer-by-layer (~5 GB working set β it never fits in RAM). | |
| 2. **Quantized** mixed-precision (MLX): experts **3-bit**, attention/embeddings/lm_head **4-bit** β **99 GB**. | |
| 3. **Healed** with **LoRA SFT** (`--no-mask-prompt`, grad-checkpointed). The current **v4** rebuild uses a | |
| **code-first balanced calibration** (so the *math* super-experts survive the prune β v3's coding-only | |
| calibration collapsed math) + heal/distill on **R1 long-CoT reasoning traces**. Router-KD / expert-wise | |
| Logit-KD are research-validated recovery stages (optional). *(GRPO/RLVR was tried and regressed β SFT.)* | |
| ## What makes it different (built + selftested) | |
| - **Verified decoding (compiler-steered):** generates line-by-line while the **real type-checker runs in | |
| the loop**; a line that adds an error is backtracked. TS 0.3 ms Β· Python ~0 ms Β· Rust 34 ms per check. | |
| Practical *only* on Apple Silicon β unified memory lets the model (GPU) and compiler (CPU) share RAM. | |
| - **The verifier mesh:** every output meets its real tool β compile+run+**idiomatic lint** (clippy/ruff/ | |
| gofmt/prettier) for 5 langs, **SQL** (sqlite), **math** (SymPy), **proofs** (**Lean 4**), design (render+see). | |
| - **A 51-tool agent** with **five defense layers** the frontier lacks out of the box: | |
| **trust** (checkpoint/rollback, secret-scan, prompt-injection guard, audit, risk-gate), | |
| **reliability** (constraint-pinning vs context-rot, false-success guard, flaky-test re-run, onboarding map), | |
| **self-improvement** (skill library, large-output pointers, clarify-before-assuming), | |
| **integrity** (test-tamper guard, fabrication-proof `done`, scope enforcement, slopsquat guard), | |
| plus a **humanizer** (kills AI-slop, matches your voice). | |
| - **Own your repo:** `scripts/64_own_your_repo.py` fine-tunes the model on *your* private codebase so it | |
| writes in your style β a cloud flagship can't be tuned on your private code. | |
| - **Design soul** (render-and-measure critic: WCAG/type-scale/OKLCH), **CallSieve** zero-token retrieval + | |
| live-docs RAG, **vision/voice/video** (all MLX), code-rendered math/arch figures (matplotlib/manim/TikZ). | |
| ## Requirements | |
| - **Apple Silicon, 128 GB** unified memory (M5-class recommended), macOS 26/27+. **MLX β₯ 0.31.** | |
| - The architecture (`glm_moe_dsa`: MLA + DSA sparse attention) needs the **bundled patch** (`glm_moe_dsa.py` | |
| + `install_glm_dsa_patch.py`) β current stock mlx_lm can't load it. **Native support is landing upstream** | |
| ([ml-explore/mlx-lm PR #1410](https://github.com/ml-explore/mlx-lm/pull/1410)); once it merges, recent mlx_lm | |
| loads this model with **no patch** β the bundled patch is the interim loader for older versions. | |
| - **β οΈ Raise the GPU memory ceiling β required.** The model needs ~101.6 GB; macOS caps the GPU | |
| working set at ~110 GB by default, so it OOM-crashes (Metal command-buffer timeout) on long | |
| generations. Fix before serving: | |
| ```bash | |
| sudo sysctl iogpu.wired_limit_mb=122000 # 122 GB; one-shot (resets on reboot) | |
| sudo bash dist/install_gpu_limit.sh # OR: persist it via a LaunchDaemon | |
| ``` | |
| Without this the model appears to "randomly crash" β it's just memory-starved. | |
| ## Use it | |
| ```bash | |
| python dist/install_glm_dsa_patch.py # patch mlx_lm (venv AND LM Studio's bundled engine) | |
| GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \ | |
| --adapter-path heal/adapters-v4 # serve (OpenAI-compatible); v2 + heal/adapters also ship | |
| # query it β `enable_thinking` toggles the reasoning trace (GLM-specific; off = faster, on = harder problems): | |
| curl -s localhost:8080/v1/chat/completions -H 'Content-Type: application/json' \ | |
| -d '{"messages":[{"role":"user","content":"Write a typed debounce in TypeScript."}],"chat_template_kwargs":{"enable_thinking":true}}' | |
| # drive the 51-tool agent on your repo: | |
| python scripts/57_tool_agent.py --repo /path/to/your/repo --apply --task "..." --test "cargo test" | |
| # speed: try --dsa-block-size 32/64/128 (free, pick fastest). External draft is Metal-unstable here; MTP self-spec is the real path. | |
| ``` | |
| In **LM Studio**: run the patch, fully quit + reopen, then load the model. | |
| **Design β elite, not just competent** (full guide + copy-paste system prompt: [`design/DESIGN.md`](design/DESIGN.md), with 9 movement-grounded gold seeds): the base prior reverts to the *average* of its training (hex + arbitrary | |
| spacing), so steer + gate it. Prepend `src/design_canon.py`'s `CANON` (oklch-only Β· 8px grid Β· 1.25 type scale Β· | |
| WCAG Β· **bespoke β no Bootstrap/Tailwind/framework cookie-cutter**) as the system prompt for elite output | |
| *today*; `audit_design()` gates eliteness (OKLCH/grid/scale + rejects framework boilerplate) and the | |
| constrained decoder bans non-OKLCH tokens; `scripts/76_design_flywheel.py` (generateβauditβkeep-only-eliteβSFT) | |
| heals the **native** prior so it designs elite with no prompt at all. | |
| ## Performance (M5 Max 128 GB, v4) | |
| | Metric | Value | | |
| |---|---| | |
| | Size | 99 GB (from 381 GB mxfp4 / ~1.5 TB bf16) | | |
| | HumanEval pass@1 | **19/20 (95%)**, single-shot | | |
| | Math GSM8K | **8/12 (66%)** β recovered from v3's **0/5** (code-first balanced calibration kept the math super-experts alive through the prune) | | |
| | Algebra (SymPy-checked) | **3/4 (75%)** | | |
| | Decode speed | **11.3 tok/s** (no draft) β see the speed note in limitations | | |
| | Verified-decode checker | TS 0.3 ms Β· Python ~0 ms Β· Rust 34 ms | | |
| **Benchmark honesty:** every number is **contamination-checked** β HumanEval, GSM8K, and miniF2F test problems are | |
| *not* in the training data (0 % / 0 % / 0.4 % near-dup), so they're **reasoned, not memorized**. Method + full | |
| training-data provenance/licenses: [`TRAINING_DATA.md`](TRAINING_DATA.md). | |
| ## Which version for your runtime (June 2026 β MLX is now everywhere on Apple Silicon) | |
| | Runtime | MLX *(this repo)* | GGUF *(with the family)* | | |
| |---|---|---| | |
| | `mlx_lm` (CLI / server) | β native | β | | |
| | **LM Studio** | β Mac (dual-backend) | β Win/Linux | | |
| | **Ollama 0.19+** | β Mac (MLX engine, since Mar 2026) | β 0.30 (llama.cpp) | | |
| | **macMLX** | β native (SwiftUI + OpenAI API) | β | | |
| | `llama.cpp` | β | β | | |
| | mlx-swift apps | β when `glm_moe_dsa` lands in mlx-swift-lm | β | | |
| **MLX is the native Apple-Silicon path** β mlx_lm Β· LM Studio (Mac) Β· **Ollama 0.19+** Β· macMLX all run it | |
| (MLX beats llama.cpp ~30-40% on M5). **GGUF** (shipped with the family) covers llama.cpp + Windows/Linux. | |
| Every MLX runtime gets this model the moment `glm_moe_dsa` lands upstream | |
| ([mlx-lm PR #1410](https://github.com/ml-explore/mlx-lm/pull/1410)) β or **today** via `install_glm_dsa_patch.py`, | |
| which scans *every* mlx_lm install (LM Studio's, Ollama's, your venv's). | |
| ## Roadmap β the Demolition family (shrink, keep the soul) | |
| Same masters-trained soul (design Β· dataviz Β· code Β· security Β· math Β· prose Β· architecture Β· research), every | |
| Mac β the elite training lives in the facet-inclusive calibration + heal corpus, which are **size-agnostic**: | |
| ``` | |
| 99GB : ββββββββ (baseline, this model) | |
| 64GB : should hold ~baseline (96 GB Macs) | |
| 48GB : should hold high (64 GB Macs) | |
| 28GB : the squeeze β watch which facets dip (36-48 GB Macs) | |
| 14GB : βοΈ where does the soul start to break? (24 GB Macs) | |
| 7GB : βοΈ the floor (16 GB laptops) | |
| ``` | |
| Each size: facet-calib β prune harder β quantize β heal (the soul corpus) β soul-retention scorecard (% elite | |
| per facet). See [`design/DESIGN.md`](design/DESIGN.md). | |
| ## Honest limitations | |
| - **Specialist:** ~70% of experts pruned β strong in the target niche, weaker outside it. Not the full 743B. | |
| - **Speed ~11 tok/s decode** (reading pace; ~3 min for long thinking-ON answers). Partly MLX's still-naive | |
| **DSA attention kernels** (mlx #837 / #3402 β *improves for free* as MLX matures), partly the bandwidth | |
| cost of a 743B-class MoE on a laptop. **Measured dead-ends** (don't bother): 4-bit re-quant is *slower* | |
| for single-token decode (bandwidth-bound, smaller wins); active-experts 8β4 gives no win at batch=1. | |
| **Real path:** `--dsa-block-size` sweep (free) β upstream MLX β **MTP self-speculative** (~2.6Γ, a port | |
| for this arch). Not a quant change. | |
| - **Multilingual** ability reduced (optional vocab-trim drops ~31% of tokens). | |
| - **Design** is competent but not yet design-soul-elite (correct structure, but missed OKLCH/grid when | |
| tested) β the design-canon heal closes this. | |
| - Prompt-cache can OOM under heavy concurrent load. The external speculative draft is **Metal-unstable** | |
| on this MoE β **MTP self-speculative is the right path**; the external draft is not recommended. | |
| ## Attribution & license | |
| **MIT.** Base model Β© **Z.ai** (`zai-org/GLM-5.2`, MIT-licensed) β so this derivative is MIT too: free | |
| to use, modify, and redistribute **with attribution to Z.ai**. The demolition / healing / 51-tool agent | |
| tooling is this repo's contribution. | |