Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX
Run Hermes
hermes
- MLX LM
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
Speed β the MEASURED reality (M5 Max 128 GB)
Root cause (confirmed): decode is memory-bandwidth bound β every token reloads expert weights from the ~106 GB model. The honest conclusion that follows was measured, not guessed (2026-06-18):
Every single-stream speedup lever has been benchmarked, and every one is dead on this MoE. ~11β14 tok/s is the hard memory floor. You cannot speculate your way under it.
| Lever | Hoped | MEASURED | Why it fails |
|---|---|---|---|
| MTP self-speculative | ~2.6Γ | 0% accept (89_mtp_gate.py) |
the native head is UN-pruned (256 experts) vs our pruned 77 β its MoE/router don't match β garbage drafts |
| External draft model | 1.5β2.5Γ | 0.32Γ (proxy) | the batched verify forward reloads ~all 77 experts; also draft+main β126/128 GB + Metal-unstable |
| Prompt-lookup (no model) | 2β4Γ | 0.32Γ (src/prompt_lookup.py) |
lossless β but the same MoE verify-wall β 8.4 tok/forward, but each forward loads ~all experts |
| dsa-block-size / index_topk | ~1.3Γ | flat 1.00β1.03Γ (90_dsa_sweep.py) |
attention is NOT the bottleneck; the MoE expert-load dominates |
| Reduce active experts 8β6 | ~25% | quality-dead (22_speed_tune.py) |
the router was trained for top-8 of the pruned 77; can't drop post-prune without a full re-prune+heal |
The physics
A MoE decode step is bound by loading the active experts' weights. Any multi-token speculative forward (MTP, draft-model, or n-gram) loads the union of experts for those tokens (~all 77) β costs ~KΓ a single step β the per-forward token gain is exactly cancelled. So no speculative method beats single-token decode here. Proven 4 independent ways, $0 spent (vs the $4β15K an EAGLE retrain would have cost to discover the same thing).
What ACTUALLY delivers speed (and is already shipped)
- Fused MoE dequant-matmul Metal kernel (#33) β this is the 11β14 tok/s; it's the real win.
- Throughput via batching β MEASURED 2.6Γ at B=8 (
scripts/91_batch_scaling.py): total decode 15.8 β 27.1 β 34.6 β 41.1 tok/s at B = 1 / 2 / 4 / 8. On a memory-bound MoE, parallel sequences are the lever β one expert-load serves the whole batch (per-seq drops, total climbs). This is where "faster" actually lives (best-of-N, proof-search, multi-agent, the flywheel), andmlx_lm.serverbatches concurrent requests natively (is_batchable, no draft model β that's us), so the win is available at the serve (#71). Serve-verified: 1.74Γ at B=6 on the livemlx_lm.server(concurrent vs sequential,scripts/92_serve_batch_test.py) β already shipping; just fire concurrent requests. The ONLY measured speedup on this model that beats 1Γ.
Free, orthogonal wins (prefill / TTFT β not decode tok/s)
- Prompt/prefix KV cache (
05_serve.sh --prompt-cache-size) β agentic loops resend the same system+tools; cache the prefix β big time-to-first-token cut. Real, but it doesn't move the decode tok/s number. - M5 Neural Accelerators / current MLX β free generation + prefill gains from keeping
mlxcurrent. 08_think_proxy.pyβ skip the thinking trace on trivial structural steps β fewer tokens, not faster/token.
The ONLY path to a single-stream speedup (not recommended)
Train a fresh EAGLE-3 head on the demolished model's own outputs (architecture-agnostic, ~1B dense layer):
~$4β15K cloud H100 or ~weeks of local M5 data-gen. And even then it's uncertain β EAGLE on a fast MoE
baseline measured ~1.03Γ in the literature (same verify-wall). See #69. Bank the batching instead.
Bottom line
Single-stream is memory-capped at ~11β14 tok/s and that's the floor. The speed pillar is the fused kernel (#33) + throughput batching (#35/#48), both built. Don't fight physics; batch for throughput.
(Receipts: 89_mtp_gate.py β 0%, src/prompt_lookup.py --bench β 0.32Γ, 90_dsa_sweep.py β flat. Full log
in OVERNIGHT_LOG.md under the 2026-06-18 speed entries.)