---
license: mit
base_model: zai-org/GLM-5.2
base_model_relation: quantized
library_name: mlx
pipeline_tag: text-generation
language: [en]
tags: [mlx, apple-silicon, moe, pruned, quantized, soul-targeted, agentic, local-agent, glm]
private: false
datasets: [philipjohnbasile/glm52-demolition-data]
model-index:
- name: GLM-5.2-Demolition-q4a4-soul-MLX
  results:
  - task: { type: text-generation, name: Code Generation }
    dataset: { name: HumanEval, type: openai_humaneval }
    metrics:
    - { type: pass@1, value: 69.0, name: "pass@1 (HumanEval-164, single-shot, verifier-scored)" }
---

# GLM-5.2-Demolition · q4a4-soul (v3)

A **demolition** of [`zai-org/GLM-5.2`](https://huggingface.co/zai-org/GLM-5.2) (743B total / 39B active MoE,
MIT) down to a **~98 GB 4-bit** model that loads and runs **fully on a single Apple M5 Max (128 GB)**. v3's
distinguishing move is **soul-targeted expert pruning** — the kept experts are chosen by saliency measured on
*our* facet data, not a generic corpus — plus a deliberately **pure vanilla-code core** with **swappable
heritage "souls"** mounted on demand.

## Architecture: a PURE core + swappable souls
- **CORE (always-on):** *just the vanilla languages, done excellently* — Python · TypeScript · JavaScript ·
  Rust · Go · HTML · CSS · SQL · Postgres. **No frameworks, no baked specialties.** Latest versions, vanilla
  stdlib.
- **SOULS (mount per-request — the model factory):** small LoRA adapters that *name a field's masters* to
  activate latent eliteness:
  - **art** (Basquiat/Haring/Banksy/Sol LeWitt/Casey Reas) · **music** (Bach/J Dilla/Eno/WALL-E) ·
    **design** (Rams/Bauhaus) · **perfumery** (Beaux/Guerlain/Ellena) · **science** (Feynman/Darwin/Sagan) ·
    **legacy** (K&R/Knuth/Dijkstra/Hopper) · **security** (Saltzer-Schroeder/Aleph-One, purple-team) ·
    **gamedev** (Carmack/Handmade-Hero, *vanilla from-scratch*) · **fullstack** (htmx/Go-stdlib/Postgres) ·
    math · dataviz · prose · architecture · research.

## The demolition lineage (honest)
| ver | prune | quant | size | result |
|-----|-------|-------|------|--------|
| v1  | keep 30% experts (generic calib) | 3-bit | 99 GB | broke — hallucinates, sentence-loops |
| v2  | keep 23% experts (code calib)    | 4-bit | 98 GB | design coherent; trivia gone (by design) |
| **v3** | keep 23% experts (**soul calib**) | **4-bit** | ~98 GB | coherent FOCUS-9 vanilla code (healed) |

**Why 4-bit, not 3:** 3-bit was just below the quality cliff; 4-bit is just above it *and* MLX's best-optimized
kernel (cleanest packing). 2-bit is worse. No bit-width of a demolished 744B beats a clean right-sized model —
this artifact is the **best-possible demolition**, a research result, not a frontier claim.

## Method
1. **Saliency** (`23_stream_calibrate`) on our facet corpus → score each routed expert.
2. **Prune** (`24_apply_prune --ratio 0.77`) → keep the top-saliency experts.
3. **Re-quantize** (`24b_stream_requantize --bits 4`) → uniform 4-bit experts, 4-bit attn, 6-bit head.
4. **Heal** (`06_heal_lora`) — LoRA on vanilla FOCUS-9 gold; souls heal separately per facet.

## Honest scope
- **Speed:** ~10 tok/s — memory-bandwidth-bound (inherent to a 98 GB model on M5; spec-decode nets only
  ~1.05× here, so it's not used).
- **Strengths:** the FOCUS-9 vanilla languages + whichever soul is mounted. **Not** general trivia — those
  experts were deliberately pruned. Best driven by a verifier-first agent (the compiler steers each line).
- **Eval:** **HumanEval-164 pass@1 = 114/164 (69%)** — full set, single-shot, scored on hidden tests by real
  verifiers (the easy n=20 subset was 95%). Mid-tier-usable for a demolished 4-bit model on a laptop
  (≈ GPT-4-class on this metric; below dedicated frontier coders ~90%); a verifier-first agent loop lifts it
  further. Strong on writing vanilla FOCUS-9 functions from a spec; weaker on hard debugging/multi-step and
  off-distribution prompts. **Honest scope:** this is the *best-possible demolition* of a 744B model, a research
  artifact — not a frontier daily-driver. See the full measured write-up in the repo's `MISSION_SUMMARY.md`.

Built with the open pipeline at `glm52-demolition`. **Public** (MIT — GLM-5.2 is Z.ai Pure-Open).