Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX

Run Hermes

hermes

OpenClaw new

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with OpenClaw:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" \
  --custom-provider-id mlx-lm \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

MLX LM

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

GLM-5.2-Demolition-q4a4-soul-MLX / README.md

philipjohnbasile

Upload README.md with huggingface_hub

1fb8c02 verified 17 days ago

preview code

Raw

History Blame

9.41 kB

	---
	license: mit
	base_model: zai-org/GLM-5.2
	library_name: mlx
	pipeline_tag: text-generation
	language: [en]
	tags: [mlx, moe, code, agentic, glm, pruned, quantized, verified-decoding, apple-silicon, local-agent]
	---

	# GLM-5.2-Demolition — a 743B frontier MoE, demolished to run on a 128 GB Mac

	One line: we took `zai-org/GLM-5.2` (743B-parameter Mixture-of-Experts, ~381 GB at 4-bit) and
	demolished it to 99 GB so it runs fully on-device on a MacBook Pro M5 Max (128 GB) — then
	healed it and wrapped it in a 47-tool local agent that does things a cloud model structurally
	cannot: the compiler steers every line it writes, it **can't fake a passing test or leak a
	secret, and it can be fine-tuned on your private repo** so it writes in your style.

	A niche specialist, not a general model — tuned to beat a frontier model in one lane (agentic
	coding + design for TS/JS/Python/Rust/Go/HTML/CSS + Postgres) by out-verifying it, not out-knowing it.

	## My AI-Engineer / Full-Stack / Data-Science / ML build
	This is the version I run wearing all four hats — one on-device model, no cloud key, tooled for the
	whole stack of those roles (strongest in the coding/agentic lane, deliberately so):
	- AI Engineer — builds and ships agentic AI locally: the 47-tool ReAct agent, **verified +
	constrained decoding**, grammar-constrained tool I/O, MLX-native serving + the speed/stability work
	(prompt-cache, continuous batching, frontier-grade serving). The model that makes AI products.
	- Full-Stack — front-to-back in TS/JS/Python/Rust/Go/HTML/CSS + Postgres, the **compiler steering
	every line, a design soul** (render-and-see critic: WCAG / type-scale / OKLCH) for the UI, and
	SQL-on-a-real-schema for the API — plus edit→test→fix agentic loops on your repo.
	- Data Science — stateful REPL, SymPy / pandas / numpy / sklearn, arXiv-RAG, competition-grade
	math (GSM8K-style), and code-rendered figures (matplotlib / manim / TikZ).
	- Machine Learning — it is applied ML end-to-end: REAP expert-pruning (256→77), **mixed-precision
	quantization, LoRA healing, distillation, MTP self-speculation**, GRPO/RLVR experiments — the
	build itself is a working reference.

	…and the hats that fall straight out of "verify-everything":
	- Security / DevSecOps — secret-scanning (16 providers: AWS/GitHub/OpenAI/Anthropic/HuggingFace/Slack/Stripe/Google/DB-URLs/JWT/PEM…),
	prompt-injection guard, test-tamper + fabrication-proof `done`, slopsquat/typosquat guard, risk-gated
	tools. It structurally can't leak a key or fake a green test.
	- Formal-Methods / Verification Engineer — a local Lean-4 prover (premise selection, expert-iteration,
	self-correction from the real Lean error) → correct-by-construction math, not vibes.
	- MLOps / Inference — the serving spine: prompt-cache, continuous batching, watchdog + circuit-breaker +
	memory-ceiling — frontier-grade stability for hours-long local runs on one box.
	- Multimodal / CV — reads images + video (VLM), palette-steered image-gen, code-rendered
	video/figures (manim/TikZ) — all MLX.
	- Design Engineer — a render-and-*see* critic enforcing WCAG contrast, modular type scale,
	8 px grid, OKLCH harmony (not just "looks fine").

	One model, fully local, verify-everything — every hat above, on a MacBook.

	## How it was made
	1. Pruned the MoE experts 256 → 77 by router-weighted saliency (REAP = `router_weight ×
	activation_norm`, padding-masked), streaming layer-by-layer (~5 GB working set — it never fits in RAM).
	2. Quantized mixed-precision (MLX): experts 3-bit, attention/embeddings/lm_head 4-bit → 99 GB.
	3. Healed with LoRA SFT (`--no-mask-prompt`, grad-checkpointed). The current v4 rebuild uses a
	code-first balanced calibration (so the math super-experts survive the prune — v3's coding-only
	calibration collapsed math) + heal/distill on R1 long-CoT reasoning traces. Router-KD / expert-wise
	Logit-KD are research-validated recovery stages (optional). (GRPO/RLVR was tried and regressed → SFT.)

	## What makes it different (built + selftested)
	- Verified decoding (compiler-steered): generates line-by-line while the **real type-checker runs in
	the loop**; a line that adds an error is backtracked. TS 0.3 ms · Python ~0 ms · Rust 34 ms per check.
	Practical only on Apple Silicon — unified memory lets the model (GPU) and compiler (CPU) share RAM.
	- The verifier mesh: every output meets its real tool — compile+run+idiomatic lint (clippy/ruff/
	gofmt/prettier) for 5 langs, SQL (sqlite), math (SymPy), proofs (Lean 4), design (render+see).
	- A 47-tool agent with five defense layers the frontier lacks out of the box:
	trust (checkpoint/rollback, secret-scan, prompt-injection guard, audit, risk-gate),
	reliability (constraint-pinning vs context-rot, false-success guard, flaky-test re-run, onboarding map),
	self-improvement (skill library, large-output pointers, clarify-before-assuming),
	integrity (test-tamper guard, fabrication-proof `done`, scope enforcement, slopsquat guard),
	plus a humanizer (kills AI-slop, matches your voice).
	- Own your repo: `scripts/64_own_your_repo.py` fine-tunes the model on your private codebase so it
	writes in your style — a cloud flagship can't be tuned on your private code.
	- Design soul (render-and-measure critic: WCAG/type-scale/OKLCH), CallSieve zero-token retrieval +
	live-docs RAG, vision/voice/video (all MLX), code-rendered math/arch figures (matplotlib/manim/TikZ).

	## Requirements
	- Apple Silicon, 128 GB unified memory (M5-class recommended), macOS 26/27+. MLX ≥ 0.31.
	- The architecture (`glm_moe_dsa`: MLA + DSA sparse attention) needs the bundled patch (`glm_moe_dsa.py`
	+ `install_glm_dsa_patch.py`) — current stock mlx_lm can't load it. Native support is landing upstream
	([ml-explore/mlx-lm PR #1410](https://github.com/ml-explore/mlx-lm/pull/1410)); once it merges, recent mlx_lm
	loads this model with no patch — the bundled patch is the interim loader for older versions.
	- ⚠️ Raise the GPU memory ceiling — required. The model needs ~101.6 GB; macOS caps the GPU
	working set at ~110 GB by default, so it OOM-crashes (Metal command-buffer timeout) on long
	generations. Fix before serving:
	```bash
	sudo sysctl iogpu.wired_limit_mb=122000 # 122 GB; one-shot (resets on reboot)
	sudo bash dist/install_gpu_limit.sh # OR: persist it via a LaunchDaemon
	```
	Without this the model appears to "randomly crash" — it's just memory-starved.

	## Use it
	```bash
	python dist/install_glm_dsa_patch.py # patch mlx_lm (venv AND LM Studio's bundled engine)
	GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
	--adapter-path heal/adapters-v4 # serve (OpenAI-compatible); v2 + heal/adapters also ship
	# drive the 47-tool agent on your repo:
	python scripts/57_tool_agent.py --repo /path/to/your/repo --apply --task "..." --test "cargo test"
	# speed: try --dsa-block-size 32/64/128 (free, pick fastest). External draft is Metal-unstable here; MTP self-spec is the real path.
	```
	In LM Studio: run the patch, fully quit + reopen, then load the model.

	## Performance (M5 Max 128 GB, v4)
	\| Metric \| Value \|
	\|---\|---\|
	\| Size \| 99 GB (from 381 GB mxfp4 / ~1.5 TB bf16) \|
	\| HumanEval pass@1 \| 19/20 (95%), single-shot \|
	\| Math GSM8K \| 8/12 (66%) — recovered from v3's 0/5 (code-first balanced calibration kept the math super-experts alive through the prune) \|
	\| Algebra (SymPy-checked) \| 3/4 (75%) \|
	\| Decode speed \| 11.3 tok/s (no draft) — see the speed note in limitations \|
	\| Verified-decode checker \| TS 0.3 ms · Python ~0 ms · Rust 34 ms \|

	## Honest limitations
	- Specialist: ~70% of experts pruned — strong in the target niche, weaker outside it. Not the full 743B.
	- Speed ~11 tok/s decode (reading pace; ~3 min for long thinking-ON answers). Partly MLX's still-naive
	DSA attention kernels (mlx #837 / #3402 — improves for free as MLX matures), partly the bandwidth
	cost of a 743B-class MoE on a laptop. Measured dead-ends (don't bother): 4-bit re-quant is slower
	for single-token decode (bandwidth-bound, smaller wins); active-experts 8→4 gives no win at batch=1.
	Real path: `--dsa-block-size` sweep (free) → upstream MLX → MTP self-speculative (~2.6×, a port
	for this arch). Not a quant change.
	- Multilingual ability reduced (optional vocab-trim drops ~31% of tokens).
	- Design is competent but not yet design-soul-elite (correct structure, but missed OKLCH/grid when
	tested) — the design-canon heal closes this.
	- Prompt-cache can OOM under heavy concurrent load. The external speculative draft is Metal-unstable
	on this MoE — MTP self-speculative is the right path; the external draft is not recommended.

	## Attribution & license
	MIT. Base model © Z.ai (`zai-org/GLM-5.2`, MIT-licensed) — so this derivative is MIT too: free
	to use, modify, and redistribute with attribution to Z.ai. The demolition / healing / 47-tool agent
	tooling is this repo's contribution.

	---
	license: mit
	base_model: zai-org/GLM-5.2
	library_name: mlx
	pipeline_tag: text-generation
	language: [en]
	tags: [mlx, moe, code, agentic, glm, pruned, quantized, verified-decoding, apple-silicon, local-agent]
	---

	# GLM-5.2-Demolition — a 743B frontier MoE, demolished to run on a 128 GB Mac

	One line: we took `zai-org/GLM-5.2` (743B-parameter Mixture-of-Experts, ~381 GB at 4-bit) and
	demolished it to 99 GB so it runs fully on-device on a MacBook Pro M5 Max (128 GB) — then
	healed it and wrapped it in a 47-tool local agent that does things a cloud model structurally
	cannot: the compiler steers every line it writes, it **can't fake a passing test or leak a
	secret, and it can be fine-tuned on your private repo** so it writes in your style.

	A niche specialist, not a general model — tuned to beat a frontier model in one lane (agentic
	coding + design for TS/JS/Python/Rust/Go/HTML/CSS + Postgres) by out-verifying it, not out-knowing it.

	## My AI-Engineer / Full-Stack / Data-Science / ML build
	This is the version I run wearing all four hats — one on-device model, no cloud key, tooled for the
	whole stack of those roles (strongest in the coding/agentic lane, deliberately so):
	- AI Engineer — builds and ships agentic AI locally: the 47-tool ReAct agent, **verified +
	constrained decoding**, grammar-constrained tool I/O, MLX-native serving + the speed/stability work
	(prompt-cache, continuous batching, frontier-grade serving). The model that makes AI products.
	- Full-Stack — front-to-back in TS/JS/Python/Rust/Go/HTML/CSS + Postgres, the **compiler steering
	every line, a design soul** (render-and-see critic: WCAG / type-scale / OKLCH) for the UI, and
	SQL-on-a-real-schema for the API — plus edit→test→fix agentic loops on your repo.
	- Data Science — stateful REPL, SymPy / pandas / numpy / sklearn, arXiv-RAG, competition-grade
	math (GSM8K-style), and code-rendered figures (matplotlib / manim / TikZ).
	- Machine Learning — it is applied ML end-to-end: REAP expert-pruning (256→77), **mixed-precision
	quantization, LoRA healing, distillation, MTP self-speculation**, GRPO/RLVR experiments — the
	build itself is a working reference.

	…and the hats that fall straight out of "verify-everything":
	- Security / DevSecOps — secret-scanning (16 providers: AWS/GitHub/OpenAI/Anthropic/HuggingFace/Slack/Stripe/Google/DB-URLs/JWT/PEM…),
	prompt-injection guard, test-tamper + fabrication-proof `done`, slopsquat/typosquat guard, risk-gated
	tools. It structurally can't leak a key or fake a green test.
	- Formal-Methods / Verification Engineer — a local Lean-4 prover (premise selection, expert-iteration,
	self-correction from the real Lean error) → correct-by-construction math, not vibes.
	- MLOps / Inference — the serving spine: prompt-cache, continuous batching, watchdog + circuit-breaker +
	memory-ceiling — frontier-grade stability for hours-long local runs on one box.
	- Multimodal / CV — reads images + video (VLM), palette-steered image-gen, code-rendered
	video/figures (manim/TikZ) — all MLX.
	- Design Engineer — a render-and-*see* critic enforcing WCAG contrast, modular type scale,
	8 px grid, OKLCH harmony (not just "looks fine").

	One model, fully local, verify-everything — every hat above, on a MacBook.

	## How it was made
	1. Pruned the MoE experts 256 → 77 by router-weighted saliency (REAP = `router_weight ×
	activation_norm`, padding-masked), streaming layer-by-layer (~5 GB working set — it never fits in RAM).
	2. Quantized mixed-precision (MLX): experts 3-bit, attention/embeddings/lm_head 4-bit → 99 GB.
	3. Healed with LoRA SFT (`--no-mask-prompt`, grad-checkpointed). The current v4 rebuild uses a
	code-first balanced calibration (so the math super-experts survive the prune — v3's coding-only
	calibration collapsed math) + heal/distill on R1 long-CoT reasoning traces. Router-KD / expert-wise
	Logit-KD are research-validated recovery stages (optional). (GRPO/RLVR was tried and regressed → SFT.)

	## What makes it different (built + selftested)
	- Verified decoding (compiler-steered): generates line-by-line while the **real type-checker runs in
	the loop**; a line that adds an error is backtracked. TS 0.3 ms · Python ~0 ms · Rust 34 ms per check.
	Practical only on Apple Silicon — unified memory lets the model (GPU) and compiler (CPU) share RAM.
	- The verifier mesh: every output meets its real tool — compile+run+idiomatic lint (clippy/ruff/
	gofmt/prettier) for 5 langs, SQL (sqlite), math (SymPy), proofs (Lean 4), design (render+see).
	- A 47-tool agent with five defense layers the frontier lacks out of the box:
	trust (checkpoint/rollback, secret-scan, prompt-injection guard, audit, risk-gate),
	reliability (constraint-pinning vs context-rot, false-success guard, flaky-test re-run, onboarding map),
	self-improvement (skill library, large-output pointers, clarify-before-assuming),
	integrity (test-tamper guard, fabrication-proof `done`, scope enforcement, slopsquat guard),
	plus a humanizer (kills AI-slop, matches your voice).
	- Own your repo: `scripts/64_own_your_repo.py` fine-tunes the model on your private codebase so it
	writes in your style — a cloud flagship can't be tuned on your private code.
	- Design soul (render-and-measure critic: WCAG/type-scale/OKLCH), CallSieve zero-token retrieval +
	live-docs RAG, vision/voice/video (all MLX), code-rendered math/arch figures (matplotlib/manim/TikZ).

	## Requirements
	- Apple Silicon, 128 GB unified memory (M5-class recommended), macOS 26/27+. MLX ≥ 0.31.
	- The architecture (`glm_moe_dsa`: MLA + DSA sparse attention) needs the bundled patch (`glm_moe_dsa.py`
	+ `install_glm_dsa_patch.py`) — current stock mlx_lm can't load it. Native support is landing upstream
	([ml-explore/mlx-lm PR #1410](https://github.com/ml-explore/mlx-lm/pull/1410)); once it merges, recent mlx_lm
	loads this model with no patch — the bundled patch is the interim loader for older versions.
	- ⚠️ Raise the GPU memory ceiling — required. The model needs ~101.6 GB; macOS caps the GPU
	working set at ~110 GB by default, so it OOM-crashes (Metal command-buffer timeout) on long
	generations. Fix before serving:
	```bash
	sudo sysctl iogpu.wired_limit_mb=122000 # 122 GB; one-shot (resets on reboot)
	sudo bash dist/install_gpu_limit.sh # OR: persist it via a LaunchDaemon
	```
	Without this the model appears to "randomly crash" — it's just memory-starved.

	## Use it
	```bash
	python dist/install_glm_dsa_patch.py # patch mlx_lm (venv AND LM Studio's bundled engine)
	GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
	--adapter-path heal/adapters-v4 # serve (OpenAI-compatible); v2 + heal/adapters also ship
	# drive the 47-tool agent on your repo:
	python scripts/57_tool_agent.py --repo /path/to/your/repo --apply --task "..." --test "cargo test"
	# speed: try --dsa-block-size 32/64/128 (free, pick fastest). External draft is Metal-unstable here; MTP self-spec is the real path.
	```
	In LM Studio: run the patch, fully quit + reopen, then load the model.

	## Performance (M5 Max 128 GB, v4)
	\| Metric \| Value \|
	\|---\|---\|
	\| Size \| 99 GB (from 381 GB mxfp4 / ~1.5 TB bf16) \|
	\| HumanEval pass@1 \| 19/20 (95%), single-shot \|
	\| Math GSM8K \| 8/12 (66%) — recovered from v3's 0/5 (code-first balanced calibration kept the math super-experts alive through the prune) \|
	\| Algebra (SymPy-checked) \| 3/4 (75%) \|
	\| Decode speed \| 11.3 tok/s (no draft) — see the speed note in limitations \|
	\| Verified-decode checker \| TS 0.3 ms · Python ~0 ms · Rust 34 ms \|

	## Honest limitations
	- Specialist: ~70% of experts pruned — strong in the target niche, weaker outside it. Not the full 743B.
	- Speed ~11 tok/s decode (reading pace; ~3 min for long thinking-ON answers). Partly MLX's still-naive
	DSA attention kernels (mlx #837 / #3402 — improves for free as MLX matures), partly the bandwidth
	cost of a 743B-class MoE on a laptop. Measured dead-ends (don't bother): 4-bit re-quant is slower
	for single-token decode (bandwidth-bound, smaller wins); active-experts 8→4 gives no win at batch=1.
	Real path: `--dsa-block-size` sweep (free) → upstream MLX → MTP self-speculative (~2.6×, a port
	for this arch). Not a quant change.
	- Multilingual ability reduced (optional vocab-trim drops ~31% of tokens).
	- Design is competent but not yet design-soul-elite (correct structure, but missed OKLCH/grid when
	tested) — the design-canon heal closes this.
	- Prompt-cache can OOM under heavy concurrent load. The external speculative draft is Metal-unstable
	on this MoE — MTP self-speculative is the right path; the external draft is not recommended.

	## Attribution & license
	MIT. Base model © Z.ai (`zai-org/GLM-5.2`, MIT-licensed) — so this derivative is MIT too: free
	to use, modify, and redistribute with attribution to Z.ai. The demolition / healing / 47-tool agent
	tooling is this repo's contribution.