Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX
Run Hermes
hermes
- OpenClaw new
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with OpenClaw:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" \ --custom-provider-id mlx-lm \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- MLX LM
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
license: mit
base_model: zai-org/GLM-5.2
library_name: mlx
pipeline_tag: text-generation
language:
- en
tags:
- mlx
- moe
- code
- agentic
- glm
- pruned
- quantized
- verified-decoding
- apple-silicon
- local-agent
GLM-5.2-Demolition β a 743B frontier MoE, demolished to run on a 128 GB Mac
One line: we took zai-org/GLM-5.2 (743B-parameter Mixture-of-Experts, ~381 GB at 4-bit) and
demolished it to 99 GB so it runs fully on-device on a MacBook Pro M5 Max (128 GB) β then
healed it and wrapped it in a 51-tool local agent that does things a cloud model structurally
cannot: the compiler steers every line it writes, it can't fake a passing test or leak a
secret, and it can be fine-tuned on your private repo so it writes in your style.
A niche specialist, not a general model β tuned to beat a frontier model in one lane (agentic coding + design for TS/JS/Python/Rust/Go/HTML/CSS + Postgres) by out-verifying it, not out-knowing it.
My AI-Engineer / Full-Stack / Data-Science / ML build
This is the version I run wearing all four hats β one on-device model, no cloud key, tooled for the whole stack of those roles (strongest in the coding/agentic lane, deliberately so):
- AI Engineer β builds and ships agentic AI locally: the 51-tool ReAct agent, verified + constrained decoding, grammar-constrained tool I/O, MLX-native serving + the speed/stability work (prompt-cache, continuous batching, frontier-grade serving). The model that makes AI products.
- Full-Stack β front-to-back in TS/JS/Python/Rust/Go/HTML/CSS + Postgres, the compiler steering every line, a design soul (render-and-see critic: WCAG / type-scale / OKLCH) for the UI, and SQL-on-a-real-schema for the API β plus editβtestβfix agentic loops on your repo.
- Data Science β stateful REPL, SymPy / pandas / numpy / sklearn, arXiv-RAG, competition-grade math (GSM8K-style), and code-rendered figures (matplotlib / manim / TikZ).
- Machine Learning β it is applied ML end-to-end: REAP expert-pruning (256β77), mixed-precision quantization, LoRA healing, distillation, MTP self-speculation, GRPO/RLVR experiments β the build itself is a working reference.
β¦and the hats that fall straight out of "verify-everything":
- Security / DevSecOps β secret-scanning (16 providers: AWS/GitHub/OpenAI/Anthropic/HuggingFace/Slack/Stripe/Google/DB-URLs/JWT/PEMβ¦),
prompt-injection guard, test-tamper + fabrication-proof
done, slopsquat/typosquat guard, risk-gated tools. It structurally can't leak a key or fake a green test. - Formal-Methods / Verification Engineer β a local Lean-4 prover (premise selection, expert-iteration, self-correction from the real Lean error) β correct-by-construction math, not vibes.
- MLOps / Inference β the serving spine: prompt-cache, continuous batching, watchdog + circuit-breaker + memory-ceiling β frontier-grade stability for hours-long local runs on one box.
- Multimodal / CV β reads images + video (VLM), palette-steered image-gen, code-rendered video/figures (manim/TikZ) β all MLX.
- Design Engineer β a render-and-see critic enforcing WCAG contrast, modular type scale, 8 px grid, OKLCH harmony (not just "looks fine").
One model, fully local, verify-everything β every hat above, on a MacBook.
How it was made
- Pruned the MoE experts 256 β 77 by router-weighted saliency (REAP =
router_weight Γ activation_norm, padding-masked), streaming layer-by-layer (~5 GB working set β it never fits in RAM). - Quantized mixed-precision (MLX): experts 3-bit, attention/embeddings/lm_head 4-bit β 99 GB.
- Healed with LoRA SFT (
--no-mask-prompt, grad-checkpointed). The current v4 rebuild uses a code-first balanced calibration (so the math super-experts survive the prune β v3's coding-only calibration collapsed math) + heal/distill on R1 long-CoT reasoning traces. Router-KD / expert-wise Logit-KD are research-validated recovery stages (optional). (GRPO/RLVR was tried and regressed β SFT.)
What makes it different (built + selftested)
- Verified decoding (compiler-steered): generates line-by-line while the real type-checker runs in the loop; a line that adds an error is backtracked. TS 0.3 ms Β· Python ~0 ms Β· Rust 34 ms per check. Practical only on Apple Silicon β unified memory lets the model (GPU) and compiler (CPU) share RAM.
- The verifier mesh: every output meets its real tool β compile+run+idiomatic lint (clippy/ruff/ gofmt/prettier) for 5 langs, SQL (sqlite), math (SymPy), proofs (Lean 4), design (render+see).
- A 51-tool agent with five defense layers the frontier lacks out of the box:
trust (checkpoint/rollback, secret-scan, prompt-injection guard, audit, risk-gate),
reliability (constraint-pinning vs context-rot, false-success guard, flaky-test re-run, onboarding map),
self-improvement (skill library, large-output pointers, clarify-before-assuming),
integrity (test-tamper guard, fabrication-proof
done, scope enforcement, slopsquat guard), plus a humanizer (kills AI-slop, matches your voice). - Own your repo:
scripts/64_own_your_repo.pyfine-tunes the model on your private codebase so it writes in your style β a cloud flagship can't be tuned on your private code. - Design soul (render-and-measure critic: WCAG/type-scale/OKLCH), CallSieve zero-token retrieval + live-docs RAG, vision/voice/video (all MLX), code-rendered math/arch figures (matplotlib/manim/TikZ).
Requirements
- Apple Silicon, 128 GB unified memory (M5-class recommended), macOS 26/27+. MLX β₯ 0.31.
- The architecture (
glm_moe_dsa: MLA + DSA sparse attention) needs the bundled patch (glm_moe_dsa.pyinstall_glm_dsa_patch.py) β current stock mlx_lm can't load it. Native support is landing upstream (ml-explore/mlx-lm PR #1410); once it merges, recent mlx_lm loads this model with no patch β the bundled patch is the interim loader for older versions.
- β οΈ Raise the GPU memory ceiling β required. The model needs ~101.6 GB; macOS caps the GPU
working set at ~110 GB by default, so it OOM-crashes (Metal command-buffer timeout) on long
generations. Fix before serving:
Without this the model appears to "randomly crash" β it's just memory-starved.sudo sysctl iogpu.wired_limit_mb=122000 # 122 GB; one-shot (resets on reboot) sudo bash dist/install_gpu_limit.sh # OR: persist it via a LaunchDaemon
Use it
python dist/install_glm_dsa_patch.py # patch mlx_lm (venv AND LM Studio's bundled engine)
GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
--adapter-path heal/adapters-v4 # serve (OpenAI-compatible); v2 + heal/adapters also ship
# query it β `enable_thinking` toggles the reasoning trace (GLM-specific; off = faster, on = harder problems):
curl -s localhost:8080/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":"Write a typed debounce in TypeScript."}],"chat_template_kwargs":{"enable_thinking":true}}'
# drive the 51-tool agent on your repo:
python scripts/57_tool_agent.py --repo /path/to/your/repo --apply --task "..." --test "cargo test"
# speed: try --dsa-block-size 32/64/128 (free, pick fastest). External draft is Metal-unstable here; MTP self-spec is the real path.
In LM Studio: run the patch, fully quit + reopen, then load the model.
Design β elite, not just competent (full guide + copy-paste system prompt: design/DESIGN.md, with 9 movement-grounded gold seeds): the base prior reverts to the average of its training (hex + arbitrary
spacing), so steer + gate it. Prepend src/design_canon.py's CANON (oklch-only Β· 8px grid Β· 1.25 type scale Β·
WCAG Β· bespoke β no Bootstrap/Tailwind/framework cookie-cutter) as the system prompt for elite output
today; audit_design() gates eliteness (OKLCH/grid/scale + rejects framework boilerplate) and the
constrained decoder bans non-OKLCH tokens; scripts/76_design_flywheel.py (generateβauditβkeep-only-eliteβSFT)
heals the native prior so it designs elite with no prompt at all.
Performance (M5 Max 128 GB, v4)
| Metric | Value |
|---|---|
| Size | 99 GB (from 381 GB mxfp4 / ~1.5 TB bf16) |
| HumanEval pass@1 | 19/20 (95%), single-shot |
| Math GSM8K | 8/12 (66%) β recovered from v3's 0/5 (code-first balanced calibration kept the math super-experts alive through the prune) |
| Algebra (SymPy-checked) | 3/4 (75%) |
| Decode speed | 11.3 tok/s (no draft) β see the speed note in limitations |
| Verified-decode checker | TS 0.3 ms Β· Python ~0 ms Β· Rust 34 ms |
Benchmark honesty: every number is contamination-checked β HumanEval, GSM8K, and miniF2F test problems are
not in the training data (0 % / 0 % / 0.4 % near-dup), so they're reasoned, not memorized. Method + full
training-data provenance/licenses: TRAINING_DATA.md.
Which version for your runtime (June 2026 β MLX is now everywhere on Apple Silicon)
| Runtime | MLX (this repo) | GGUF (with the family) |
|---|---|---|
mlx_lm (CLI / server) |
β native | β |
| LM Studio | β Mac (dual-backend) | β Win/Linux |
| Ollama 0.19+ | β Mac (MLX engine, since Mar 2026) | β 0.30 (llama.cpp) |
| macMLX | β native (SwiftUI + OpenAI API) | β |
llama.cpp |
β | β |
| mlx-swift apps | β
when glm_moe_dsa lands in mlx-swift-lm |
β |
MLX is the native Apple-Silicon path β mlx_lm Β· LM Studio (Mac) Β· Ollama 0.19+ Β· macMLX all run it
(MLX beats llama.cpp ~30-40% on M5). GGUF (shipped with the family) covers llama.cpp + Windows/Linux.
Every MLX runtime gets this model the moment glm_moe_dsa lands upstream
(mlx-lm PR #1410) β or today via install_glm_dsa_patch.py,
which scans every mlx_lm install (LM Studio's, Ollama's, your venv's).
Roadmap β the Demolition family (shrink, keep the soul)
Same masters-trained soul (design Β· dataviz Β· code Β· security Β· math Β· prose Β· architecture Β· research), every Mac β the elite training lives in the facet-inclusive calibration + heal corpus, which are size-agnostic:
99GB : ββββββββ (baseline, this model)
64GB : should hold ~baseline (96 GB Macs)
48GB : should hold high (64 GB Macs)
28GB : the squeeze β watch which facets dip (36-48 GB Macs)
14GB : βοΈ where does the soul start to break? (24 GB Macs)
7GB : βοΈ the floor (16 GB laptops)
Each size: facet-calib β prune harder β quantize β heal (the soul corpus) β soul-retention scorecard (% elite
per facet). See design/DESIGN.md.
Honest limitations
- Specialist: ~70% of experts pruned β strong in the target niche, weaker outside it. Not the full 743B.
- Speed ~11 tok/s decode (reading pace;
3 min for long thinking-ON answers). Partly MLX's still-naive DSA attention kernels (mlx #837 / #3402 β improves for free as MLX matures), partly the bandwidth cost of a 743B-class MoE on a laptop. Measured dead-ends (don't bother): 4-bit re-quant is slower for single-token decode (bandwidth-bound, smaller wins); active-experts 8β4 gives no win at batch=1. Real path:2.6Γ, a port for this arch). Not a quant change.--dsa-block-sizesweep (free) β upstream MLX β MTP self-speculative ( - Multilingual ability reduced (optional vocab-trim drops ~31% of tokens).
- Design is competent but not yet design-soul-elite (correct structure, but missed OKLCH/grid when tested) β the design-canon heal closes this.
- Prompt-cache can OOM under heavy concurrent load. The external speculative draft is Metal-unstable on this MoE β MTP self-speculative is the right path; the external draft is not recommended.
Attribution & license
MIT. Base model Β© Z.ai (zai-org/GLM-5.2, MIT-licensed) β so this derivative is MIT too: free
to use, modify, and redistribute with attribution to Z.ai. The demolition / healing / 51-tool agent
tooling is this repo's contribution.
