Text Generation
MLX
Safetensors
English
glm_moe_dsa
apple-silicon
Mixture of Experts
pruned
quantized
soul-targeted
agentic
local-agent
glm
conversational
Eval Results (legacy)
4-bit precision
Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX
Run Hermes
hermes
- MLX LM
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,116 +1,42 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
base_model: zai-org/GLM-5.2
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
language: [en]
|
| 7 |
-
tags: [mlx, moe, code, agentic, glm, pruned, quantized, verified-decoding, apple-silicon, local-agent]
|
| 8 |
---
|
| 9 |
|
| 10 |
-
# GLM-5.2-Demolition
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
live-docs RAG, **vision/voice/video** (all MLX), code-rendered math/arch figures (matplotlib/manim/TikZ).
|
| 46 |
-
|
| 47 |
-
## Every chip on the M5 Max, working
|
| 48 |
-
The agent spreads perception, verification, and dispatch across **all six compute blocks** so the GPU stays
|
| 49 |
-
free for token generation (built + selftested):
|
| 50 |
-
- **GPU** (40-core + M5 Neural Accelerators) β the 99 GB model decodes + LoRA-heals.
|
| 51 |
-
- **Neural Engine** (16-core) β embeddings Β· OCR Β· image segmentation / pose / object-detection Β· NER+POS Β·
|
| 52 |
-
audio classification + VAD Β· neural TTS Β· zero-shot routing Β· rerank β all via Apple frameworks, no CoreML, no GPU.
|
| 53 |
-
- **18 CPU cores** β the verifier mesh fanned out (`verify_many`, 6.6Γ) Β· 9-language compile-verify Β· tabular ML.
|
| 54 |
-
- **Media Engine** β hardware H.264/HEVC/AV1 decode + encode for the video lane.
|
| 55 |
-
- **AMX/SME** β matrix coprocessor via Accelerate (~2.1 TFLOP/s f32), implicit in every numpy op.
|
| 56 |
-
- **ASR** = **Whisper on MLX** (no mic-permission needed). An **Any-to-Any omni-router** sends any input
|
| 57 |
-
(text / image / audio / video / table) to its optimal block.
|
| 58 |
-
|
| 59 |
-
## The model factory (swappable domain souls)
|
| 60 |
-
One 99 GB base + hot-swappable LoRA "souls" (~100 MB each) β change the model's specialty by swapping the
|
| 61 |
-
adapter: **code Β· design Β· agentic Β· gamedev Β· legacy/enterprise Β· security Β· fullstack Β· science Β· data Β·
|
| 62 |
-
perfumery**. Each is healed from the same base by an autonomous chain that forges the whole library overnight
|
| 63 |
-
on the one Mac β and a `factory`-dispatcher soul makes the model route requests to the right specialty itself.
|
| 64 |
-
|
| 65 |
-
## Requirements
|
| 66 |
-
- **Apple Silicon, 128 GB** unified memory (M5-class recommended), macOS 26/27+. **MLX β₯ 0.31.**
|
| 67 |
-
- The architecture (`glm_moe_dsa`: MLA + DSA sparse attention) needs the **bundled patches** β stock
|
| 68 |
-
mlx_lm can't load it.
|
| 69 |
-
- **β οΈ Raise the GPU memory ceiling β required.** The model needs ~101.6 GB; macOS caps the GPU
|
| 70 |
-
working set at ~110 GB by default, so it OOM-crashes (Metal command-buffer timeout) on long
|
| 71 |
-
generations. Fix before serving:
|
| 72 |
-
```bash
|
| 73 |
-
sudo sysctl iogpu.wired_limit_mb=122000 # 122 GB; one-shot (resets on reboot)
|
| 74 |
-
sudo bash dist/install_gpu_limit.sh # OR: persist it via a LaunchDaemon
|
| 75 |
-
```
|
| 76 |
-
Without this the model appears to "randomly crash" β it's just memory-starved.
|
| 77 |
-
|
| 78 |
-
## Use it
|
| 79 |
-
```bash
|
| 80 |
-
python dist/install_glm_dsa_patch.py # patch mlx_lm (venv AND LM Studio's bundled engine)
|
| 81 |
-
GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
|
| 82 |
-
--adapter-path heal/adapters-v4 # serve (OpenAI-compatible); v2 + heal/adapters also ship
|
| 83 |
-
# drive the 47-tool agent on your repo:
|
| 84 |
-
python scripts/57_tool_agent.py --repo /path/to/your/repo --apply --task "..." --test "cargo test"
|
| 85 |
-
# speed: try --dsa-block-size 32/64/128 (free, pick fastest). External draft is Metal-unstable here; MTP self-spec is the real path.
|
| 86 |
-
```
|
| 87 |
-
In **LM Studio**: run the patch, fully quit + reopen, then load the model.
|
| 88 |
-
|
| 89 |
-
## Performance (M5 Max 128 GB, v4)
|
| 90 |
-
| Metric | Value |
|
| 91 |
-
|---|---|
|
| 92 |
-
| Size | 99 GB (from 381 GB mxfp4 / ~1.5 TB bf16) |
|
| 93 |
-
| HumanEval pass@1 | **19/20 (95%)**, single-shot |
|
| 94 |
-
| Math GSM8K | **8/12 (66%)** β recovered from v3's **0/5** (code-first balanced calibration kept the math super-experts alive through the prune) |
|
| 95 |
-
| Algebra (SymPy-checked) | **3/4 (75%)** |
|
| 96 |
-
| Decode speed | **11.3 tok/s** (no draft) β see the speed note in limitations |
|
| 97 |
-
| Verified-decode checker | TS 0.3 ms Β· Python ~0 ms Β· Rust 34 ms |
|
| 98 |
-
|
| 99 |
-
## Honest limitations
|
| 100 |
-
- **Specialist:** ~70% of experts pruned β strong in the target niche, weaker outside it. Not the full 743B.
|
| 101 |
-
- **Speed ~11 tok/s decode** (reading pace; ~3 min for long thinking-ON answers). Partly MLX's still-naive
|
| 102 |
-
**DSA attention kernels** (mlx #837 / #3402 β *improves for free* as MLX matures), partly the bandwidth
|
| 103 |
-
cost of a 743B-class MoE on a laptop. **Measured dead-ends** (don't bother): 4-bit re-quant is *slower*
|
| 104 |
-
for single-token decode (bandwidth-bound, smaller wins); active-experts 8β4 gives no win at batch=1.
|
| 105 |
-
**Real path:** `--dsa-block-size` sweep (free) β upstream MLX β **MTP self-speculative** (~2.6Γ, a port
|
| 106 |
-
for this arch). Not a quant change.
|
| 107 |
-
- **Multilingual** ability reduced (optional vocab-trim drops ~31% of tokens).
|
| 108 |
-
- **Design** is competent but not yet design-soul-elite (correct structure, but missed OKLCH/grid when
|
| 109 |
-
tested) β the design-canon heal closes this.
|
| 110 |
-
- Prompt-cache can OOM under heavy concurrent load. The external speculative draft is **Metal-unstable**
|
| 111 |
-
on this MoE β **MTP self-speculative is the right path**; the external draft is not recommended.
|
| 112 |
-
|
| 113 |
-
## Attribution & license
|
| 114 |
-
**MIT.** Base model Β© **Z.ai** (`zai-org/GLM-5.2`, MIT-licensed) β so this derivative is MIT too: free
|
| 115 |
-
to use, modify, and redistribute **with attribution to Z.ai**. The demolition / healing / 47-tool agent
|
| 116 |
-
tooling is this repo's contribution.
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
base_model: zai-org/GLM-5.2
|
| 4 |
+
tags: [mlx, apple-silicon, moe, pruned, quantized, soul-targeted]
|
| 5 |
+
private: true
|
|
|
|
|
|
|
| 6 |
---
|
| 7 |
|
| 8 |
+
# GLM-5.2-Demolition q4a4-v3 (soul-targeted)
|
| 9 |
+
|
| 10 |
+
A demolition of **GLM-5.2** (743B total / 39B active MoE, MIT) down to a **~98 GB 4-bit** model that
|
| 11 |
+
loads on a single **M5 Max 128 GB**. v3's distinguishing move: **soul-targeted expert pruning** β the
|
| 12 |
+
kept experts are chosen by saliency measured on *our* facet data (code, design, math, security, gamedev,
|
| 13 |
+
agentic, retrieval), not a generic corpus.
|
| 14 |
+
|
| 15 |
+
## The demolition lineage (honest)
|
| 16 |
+
| ver | prune | quant | size | result |
|
| 17 |
+
|-----|-------|-------|------|--------|
|
| 18 |
+
| v1 | keep 30% experts (generic) | 3-bit | 99 GB | broken β hallucinates, sentence-loops |
|
| 19 |
+
| v2 | keep 23% experts (code-calib) | 4-bit | 98 GB | design coherent; trivia gone (expected) |
|
| 20 |
+
| **v3** | keep 23% experts (**soul-calib**) | 4-bit | ~98 GB | _measured: TBD β see Eval_ |
|
| 21 |
+
|
| 22 |
+
## Method
|
| 23 |
+
1. **Saliency** (`23_stream_calibrate`) on `soul_calib_v3.jsonl` (1465 facet sequences) β streams the
|
| 24 |
+
381 GB mxfp4 original layer-by-layer, scores each routed expert by activation on our souls.
|
| 25 |
+
2. **Prune** (`24_apply_prune --ratio 0.77`) β keep the top-saliency experts per MoE layer.
|
| 26 |
+
3. **Re-quantize** (`24b_stream_requantize --bits 4`) β uniform 4-bit experts, 4-bit attn, 6-bit head.
|
| 27 |
+
4. **Heal** (`06_heal_lora`) β LoRA on the soul heal set (gold + design + flywheel verified-fixes).
|
| 28 |
+
|
| 29 |
+
## Honest scope
|
| 30 |
+
- **Speed:** ~10 tok/s (M5 memory-bandwidth bound β inherent to a 98 GB model).
|
| 31 |
+
- **Strengths:** our facets (code, design, security, math). **Not** general trivia β those experts were
|
| 32 |
+
deliberately pruned. Perfumery questions will fail *by design*.
|
| 33 |
+
- **Reality check:** a clean right-sized model (Qwen3-Coder ~30B @4-bit) is faster *and* broader. This
|
| 34 |
+
artifact is the **best-possible demolition** of a 744B giant onto a laptop β a research result, not a
|
| 35 |
+
daily driver. The methods (REAP saliency, soul-targeting, the heal recipe) are the transferable value.
|
| 36 |
+
|
| 37 |
+
## Eval (filled after measurement)
|
| 38 |
+
- design / code / security / math facet probes: _TBD_
|
| 39 |
+
- v3 (soul) vs v2 (code) on our facets: _TBD_
|
| 40 |
+
- tok/s: _TBD_
|
| 41 |
+
|
| 42 |
+
Built with the open pipeline at `glm52-demolition` (scripts 23/24/24b/06). Private until release (MIT).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|