Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX

Run Hermes

hermes

MLX LM

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

philipjohnbasile commited on 11 days ago

Commit

276c029

verified ·

1 Parent(s): dcd7cd6

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +37 -111

README.md CHANGED Viewed

@@ -1,116 +1,42 @@
 ---
 license: mit
 base_model: zai-org/GLM-5.2
-library_name: mlx
-pipeline_tag: text-generation
-language: [en]
-tags: [mlx, moe, code, agentic, glm, pruned, quantized, verified-decoding, apple-silicon, local-agent]
 ---
-# GLM-5.2-Demolition — a 743B frontier MoE, demolished to run on a 128 GB Mac
-**One line:** we took `zai-org/GLM-5.2` (743B-parameter Mixture-of-Experts, ~381 GB at 4-bit) and
-demolished it to **99 GB** so it runs **fully on-device on a MacBook Pro M5 Max (128 GB)** — then
-healed it and wrapped it in a **47-tool local agent** that does things a cloud model structurally
-cannot: the **compiler steers every line it writes**, it **can't fake a passing test or leak a
-secret**, and it can be **fine-tuned on *your* private repo** so it writes in your style.
-A **niche specialist**, not a general model — tuned to beat a frontier model *in one lane* (agentic
-coding + design for **TS/JS/Python/Rust/Go/HTML/CSS** + Postgres) by out-*verifying* it, not out-knowing it.
-## How it was made
-1. **Pruned** the MoE experts 256 → 77 by **router-weighted saliency (REAP** = `router_weight ×
-   activation_norm`, padding-masked), streaming layer-by-layer (~5 GB working set — it never fits in RAM).
-2. **Quantized** mixed-precision (MLX): experts **3-bit**, attention/embeddings/lm_head **4-bit** → **99 GB**.
-3. **Healed** with **LoRA SFT** (`--no-mask-prompt`, grad-checkpointed). The current **v4** rebuild uses a
-   **code-first balanced calibration** (so the *math* super-experts survive the prune — v3's coding-only
-   calibration collapsed math) + heal/distill on **R1 long-CoT reasoning traces**. Router-KD / expert-wise
-   Logit-KD are research-validated recovery stages (optional). *(GRPO/RLVR was tried and regressed → SFT.)*
-## What makes it different (built + selftested)
-- **Verified decoding (compiler-steered):** generates line-by-line while the **real type-checker runs in
-  the loop**; a line that adds an error is backtracked. TS 0.3 ms · Python ~0 ms · Rust 34 ms per check.
-  Practical *only* on Apple Silicon — unified memory lets the model (GPU) and compiler (CPU) share RAM.
-- **The verifier mesh:** every output meets its real tool — compile+run+**idiomatic lint** (clippy/ruff/
-  gofmt/prettier) for 5 langs, **SQL** (sqlite), **math** (SymPy), **proofs** (**Lean 4**), design (render+see).
-- **A 47-tool agent** with **five defense layers** the frontier lacks out of the box:
-  **trust** (checkpoint/rollback, secret-scan, prompt-injection guard, audit, risk-gate),
-  **reliability** (constraint-pinning vs context-rot, false-success guard, flaky-test re-run, onboarding map),
-  **self-improvement** (skill library, large-output pointers, clarify-before-assuming),
-  **integrity** (test-tamper guard, fabrication-proof `done`, scope enforcement, slopsquat guard),
-  plus a **humanizer** (kills AI-slop, matches your voice).
-- **Own your repo:** `scripts/64_own_your_repo.py` fine-tunes the model on *your* private codebase so it
-  writes in your style — a cloud flagship can't be tuned on your private code.
-- **Design soul** (render-and-measure critic: WCAG/type-scale/OKLCH), **CallSieve** zero-token retrieval +
-  live-docs RAG, **vision/voice/video** (all MLX), code-rendered math/arch figures (matplotlib/manim/TikZ).
-## Every chip on the M5 Max, working
-The agent spreads perception, verification, and dispatch across **all six compute blocks** so the GPU stays
-free for token generation (built + selftested):
-- **GPU** (40-core + M5 Neural Accelerators) — the 99 GB model decodes + LoRA-heals.
-- **Neural Engine** (16-core) — embeddings · OCR · image segmentation / pose / object-detection · NER+POS ·
-  audio classification + VAD · neural TTS · zero-shot routing · rerank — all via Apple frameworks, no CoreML, no GPU.
-- **18 CPU cores** — the verifier mesh fanned out (`verify_many`, 6.6×) · 9-language compile-verify · tabular ML.
-- **Media Engine** — hardware H.264/HEVC/AV1 decode + encode for the video lane.
-- **AMX/SME** — matrix coprocessor via Accelerate (~2.1 TFLOP/s f32), implicit in every numpy op.
-- **ASR** = **Whisper on MLX** (no mic-permission needed). An **Any-to-Any omni-router** sends any input
-  (text / image / audio / video / table) to its optimal block.
-## The model factory (swappable domain souls)
-One 99 GB base + hot-swappable LoRA "souls" (~100 MB each) — change the model's specialty by swapping the
-adapter: **code · design · agentic · gamedev · legacy/enterprise · security · fullstack · science · data ·
-perfumery**. Each is healed from the same base by an autonomous chain that forges the whole library overnight
-on the one Mac — and a `factory`-dispatcher soul makes the model route requests to the right specialty itself.
-## Requirements
-- **Apple Silicon, 128 GB** unified memory (M5-class recommended), macOS 26/27+. **MLX ≥ 0.31.**
-- The architecture (`glm_moe_dsa`: MLA + DSA sparse attention) needs the **bundled patches** — stock
-  mlx_lm can't load it.
-- **⚠️ Raise the GPU memory ceiling — required.** The model needs ~101.6 GB; macOS caps the GPU
-  working set at ~110 GB by default, so it OOM-crashes (Metal command-buffer timeout) on long
-  generations. Fix before serving:
-  ```bash
-  sudo sysctl iogpu.wired_limit_mb=122000        # 122 GB; one-shot (resets on reboot)
-  sudo bash dist/install_gpu_limit.sh            # OR: persist it via a LaunchDaemon
-  ```
-  Without this the model appears to "randomly crash" — it's just memory-starved.
-## Use it
-```bash
-python dist/install_glm_dsa_patch.py          # patch mlx_lm (venv AND LM Studio's bundled engine)
-GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
-    --adapter-path heal/adapters-v4           # serve (OpenAI-compatible); v2 + heal/adapters also ship
-# drive the 47-tool agent on your repo:
-python scripts/57_tool_agent.py --repo /path/to/your/repo --apply --task "..." --test "cargo test"
-# speed: try --dsa-block-size 32/64/128 (free, pick fastest). External draft is Metal-unstable here; MTP self-spec is the real path.
-```
-In **LM Studio**: run the patch, fully quit + reopen, then load the model.
-## Performance (M5 Max 128 GB, v4)
-| Metric | Value |
-|---|---|
-| Size | 99 GB (from 381 GB mxfp4 / ~1.5 TB bf16) |
-| HumanEval pass@1 | **19/20 (95%)**, single-shot |
-| Math GSM8K | **8/12 (66%)** — recovered from v3's **0/5** (code-first balanced calibration kept the math super-experts alive through the prune) |
-| Algebra (SymPy-checked) | **3/4 (75%)** |
-| Decode speed | **11.3 tok/s** (no draft) — see the speed note in limitations |
-| Verified-decode checker | TS 0.3 ms · Python ~0 ms · Rust 34 ms |
-## Honest limitations
-- **Specialist:** ~70% of experts pruned — strong in the target niche, weaker outside it. Not the full 743B.
-- **Speed ~11 tok/s decode** (reading pace; ~3 min for long thinking-ON answers). Partly MLX's still-naive
-  **DSA attention kernels** (mlx #837 / #3402 — *improves for free* as MLX matures), partly the bandwidth
-  cost of a 743B-class MoE on a laptop. **Measured dead-ends** (don't bother): 4-bit re-quant is *slower*
-  for single-token decode (bandwidth-bound, smaller wins); active-experts 8→4 gives no win at batch=1.
-  **Real path:** `--dsa-block-size` sweep (free) → upstream MLX → **MTP self-speculative** (~2.6×, a port
-  for this arch). Not a quant change.
-- **Multilingual** ability reduced (optional vocab-trim drops ~31% of tokens).
-- **Design** is competent but not yet design-soul-elite (correct structure, but missed OKLCH/grid when
-  tested) — the design-canon heal closes this.
-- Prompt-cache can OOM under heavy concurrent load. The external speculative draft is **Metal-unstable**
-  on this MoE — **MTP self-speculative is the right path**; the external draft is not recommended.
-## Attribution & license
-**MIT.** Base model © **Z.ai** (`zai-org/GLM-5.2`, MIT-licensed) — so this derivative is MIT too: free
-to use, modify, and redistribute **with attribution to Z.ai**. The demolition / healing / 47-tool agent
-tooling is this repo's contribution.

 ---
 license: mit
 base_model: zai-org/GLM-5.2
+tags: [mlx, apple-silicon, moe, pruned, quantized, soul-targeted]
+private: true
 ---
+# GLM-5.2-Demolition q4a4-v3 (soul-targeted)
+A demolition of **GLM-5.2** (743B total / 39B active MoE, MIT) down to a **~98 GB 4-bit** model that
+loads on a single **M5 Max 128 GB**. v3's distinguishing move: **soul-targeted expert pruning** — the
+kept experts are chosen by saliency measured on *our* facet data (code, design, math, security, gamedev,
+agentic, retrieval), not a generic corpus.
+## The demolition lineage (honest)
+| ver | prune | quant | size | result |
+|-----|-------|-------|------|--------|
+| v1  | keep 30% experts (generic) | 3-bit | 99 GB | broken — hallucinates, sentence-loops |
+| v2  | keep 23% experts (code-calib) | 4-bit | 98 GB | design coherent; trivia gone (expected) |
+| **v3** | keep 23% experts (**soul-calib**) | 4-bit | ~98 GB | _measured: TBD — see Eval_ |
+## Method
+1. **Saliency** (`23_stream_calibrate`) on `soul_calib_v3.jsonl` (1465 facet sequences) — streams the
+   381 GB mxfp4 original layer-by-layer, scores each routed expert by activation on our souls.
+2. **Prune** (`24_apply_prune --ratio 0.77`) — keep the top-saliency experts per MoE layer.
+3. **Re-quantize** (`24b_stream_requantize --bits 4`) — uniform 4-bit experts, 4-bit attn, 6-bit head.
+4. **Heal** (`06_heal_lora`) — LoRA on the soul heal set (gold + design + flywheel verified-fixes).
+## Honest scope
+- **Speed:** ~10 tok/s (M5 memory-bandwidth bound — inherent to a 98 GB model).
+- **Strengths:** our facets (code, design, security, math). **Not** general trivia — those experts were
+  deliberately pruned. Perfumery questions will fail *by design*.
+- **Reality check:** a clean right-sized model (Qwen3-Coder ~30B @4-bit) is faster *and* broader. This
+  artifact is the **best-possible demolition** of a 744B giant onto a laptop — a research result, not a
+  daily driver. The methods (REAP saliency, soul-targeting, the heal recipe) are the transferable value.
+## Eval (filled after measurement)
+- design / code / security / math facet probes: _TBD_
+- v3 (soul) vs v2 (code) on our facets: _TBD_
+- tok/s: _TBD_
+Built with the open pipeline at `glm52-demolition` (scripts 23/24/24b/06). Private until release (MIT).