GLM-5.2-Demolition · q4a4-soul (v3)

A demolition of zai-org/GLM-5.2 (743B total / 39B active MoE, MIT) down to a ~98 GB 4-bit model that loads and runs fully on a single Apple M5 Max (128 GB). v3's distinguishing move is soul-targeted expert pruning — the kept experts are chosen by saliency measured on our facet data, not a generic corpus — plus a deliberately pure vanilla-code core with swappable heritage "souls" mounted on demand.

Architecture: a PURE core + swappable souls

  • CORE (always-on): just the vanilla languages, done excellently — Python · TypeScript · JavaScript · Rust · Go · HTML · CSS · SQL · Postgres. No frameworks, no baked specialties. Latest versions, vanilla stdlib.
  • SOULS (mount per-request — the model factory): small LoRA adapters that name a field's masters to activate latent eliteness:
    • art (Basquiat/Haring/Banksy/Sol LeWitt/Casey Reas) · music (Bach/J Dilla/Eno/WALL-E) · design (Rams/Bauhaus) · perfumery (Beaux/Guerlain/Ellena) · science (Feynman/Darwin/Sagan) · legacy (K&R/Knuth/Dijkstra/Hopper) · security (Saltzer-Schroeder/Aleph-One, purple-team) · gamedev (Carmack/Handmade-Hero, vanilla from-scratch) · fullstack (htmx/Go-stdlib/Postgres) · math · dataviz · prose · architecture · research.

The demolition lineage (honest)

ver prune quant size result
v1 keep 30% experts (generic calib) 3-bit 99 GB broke — hallucinates, sentence-loops
v2 keep 23% experts (code calib) 4-bit 98 GB design coherent; trivia gone (by design)
v3 keep 23% experts (soul calib) 4-bit ~98 GB coherent FOCUS-9 vanilla code (healed)

Why 4-bit, not 3: 3-bit was just below the quality cliff; 4-bit is just above it and MLX's best-optimized kernel (cleanest packing). 2-bit is worse. No bit-width of a demolished 744B beats a clean right-sized model — this artifact is the best-possible demolition, a research result, not a frontier claim.

Method

  1. Saliency (23_stream_calibrate) on our facet corpus → score each routed expert.
  2. Prune (24_apply_prune --ratio 0.77) → keep the top-saliency experts.
  3. Re-quantize (24b_stream_requantize --bits 4) → uniform 4-bit experts, 4-bit attn, 6-bit head.
  4. Heal (06_heal_lora) — LoRA on vanilla FOCUS-9 gold; souls heal separately per facet.

Honest scope

  • Speed: ~10 tok/s — memory-bandwidth-bound (inherent to a 98 GB model on M5; spec-decode nets only ~1.05× here, so it's not used).
  • Strengths: the FOCUS-9 vanilla languages + whichever soul is mounted. Not general trivia — those experts were deliberately pruned. Best driven by a verifier-first agent (the compiler steers each line).
  • Eval: HumanEval-164 pass@1 = 114/164 (69%) — full set, single-shot, scored on hidden tests by real verifiers (the easy n=20 subset was 95%). Mid-tier-usable for a demolished 4-bit model on a laptop (≈ GPT-4-class on this metric; below dedicated frontier coders ~90%); a verifier-first agent loop lifts it further. Strong on writing vanilla FOCUS-9 functions from a spec; weaker on hard debugging/multi-step and off-distribution prompts. Honest scope: this is the best-possible demolition of a 744B model, a research artifact — not a frontier daily-driver. See the full measured write-up in the repo's MISSION_SUMMARY.md.

Built with the open pipeline at glm52-demolition. Public (MIT — GLM-5.2 is Z.ai Pure-Open).

Downloads last month
2,819
Safetensors
Model size
29B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX

Base model

zai-org/GLM-5.2
Quantized
(74)
this model

Dataset used to train philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX

Evaluation results

  • pass@1 (HumanEval-164, single-shot, verifier-scored) on HumanEval
    self-reported
    69.000