philipjohnbasile commited on
Commit
276c029
Β·
verified Β·
1 Parent(s): dcd7cd6

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +37 -111
README.md CHANGED
@@ -1,116 +1,42 @@
1
  ---
2
  license: mit
3
  base_model: zai-org/GLM-5.2
4
- library_name: mlx
5
- pipeline_tag: text-generation
6
- language: [en]
7
- tags: [mlx, moe, code, agentic, glm, pruned, quantized, verified-decoding, apple-silicon, local-agent]
8
  ---
9
 
10
- # GLM-5.2-Demolition β€” a 743B frontier MoE, demolished to run on a 128 GB Mac
11
-
12
- **One line:** we took `zai-org/GLM-5.2` (743B-parameter Mixture-of-Experts, ~381 GB at 4-bit) and
13
- demolished it to **99 GB** so it runs **fully on-device on a MacBook Pro M5 Max (128 GB)** β€” then
14
- healed it and wrapped it in a **47-tool local agent** that does things a cloud model structurally
15
- cannot: the **compiler steers every line it writes**, it **can't fake a passing test or leak a
16
- secret**, and it can be **fine-tuned on *your* private repo** so it writes in your style.
17
-
18
- A **niche specialist**, not a general model β€” tuned to beat a frontier model *in one lane* (agentic
19
- coding + design for **TS/JS/Python/Rust/Go/HTML/CSS** + Postgres) by out-*verifying* it, not out-knowing it.
20
-
21
- ## How it was made
22
- 1. **Pruned** the MoE experts 256 β†’ 77 by **router-weighted saliency (REAP** = `router_weight Γ—
23
- activation_norm`, padding-masked), streaming layer-by-layer (~5 GB working set β€” it never fits in RAM).
24
- 2. **Quantized** mixed-precision (MLX): experts **3-bit**, attention/embeddings/lm_head **4-bit** β†’ **99 GB**.
25
- 3. **Healed** with **LoRA SFT** (`--no-mask-prompt`, grad-checkpointed). The current **v4** rebuild uses a
26
- **code-first balanced calibration** (so the *math* super-experts survive the prune β€” v3's coding-only
27
- calibration collapsed math) + heal/distill on **R1 long-CoT reasoning traces**. Router-KD / expert-wise
28
- Logit-KD are research-validated recovery stages (optional). *(GRPO/RLVR was tried and regressed β†’ SFT.)*
29
-
30
- ## What makes it different (built + selftested)
31
- - **Verified decoding (compiler-steered):** generates line-by-line while the **real type-checker runs in
32
- the loop**; a line that adds an error is backtracked. TS 0.3 ms Β· Python ~0 ms Β· Rust 34 ms per check.
33
- Practical *only* on Apple Silicon β€” unified memory lets the model (GPU) and compiler (CPU) share RAM.
34
- - **The verifier mesh:** every output meets its real tool β€” compile+run+**idiomatic lint** (clippy/ruff/
35
- gofmt/prettier) for 5 langs, **SQL** (sqlite), **math** (SymPy), **proofs** (**Lean 4**), design (render+see).
36
- - **A 47-tool agent** with **five defense layers** the frontier lacks out of the box:
37
- **trust** (checkpoint/rollback, secret-scan, prompt-injection guard, audit, risk-gate),
38
- **reliability** (constraint-pinning vs context-rot, false-success guard, flaky-test re-run, onboarding map),
39
- **self-improvement** (skill library, large-output pointers, clarify-before-assuming),
40
- **integrity** (test-tamper guard, fabrication-proof `done`, scope enforcement, slopsquat guard),
41
- plus a **humanizer** (kills AI-slop, matches your voice).
42
- - **Own your repo:** `scripts/64_own_your_repo.py` fine-tunes the model on *your* private codebase so it
43
- writes in your style β€” a cloud flagship can't be tuned on your private code.
44
- - **Design soul** (render-and-measure critic: WCAG/type-scale/OKLCH), **CallSieve** zero-token retrieval +
45
- live-docs RAG, **vision/voice/video** (all MLX), code-rendered math/arch figures (matplotlib/manim/TikZ).
46
-
47
- ## Every chip on the M5 Max, working
48
- The agent spreads perception, verification, and dispatch across **all six compute blocks** so the GPU stays
49
- free for token generation (built + selftested):
50
- - **GPU** (40-core + M5 Neural Accelerators) β€” the 99 GB model decodes + LoRA-heals.
51
- - **Neural Engine** (16-core) β€” embeddings Β· OCR Β· image segmentation / pose / object-detection Β· NER+POS Β·
52
- audio classification + VAD Β· neural TTS Β· zero-shot routing Β· rerank β€” all via Apple frameworks, no CoreML, no GPU.
53
- - **18 CPU cores** β€” the verifier mesh fanned out (`verify_many`, 6.6Γ—) Β· 9-language compile-verify Β· tabular ML.
54
- - **Media Engine** β€” hardware H.264/HEVC/AV1 decode + encode for the video lane.
55
- - **AMX/SME** β€” matrix coprocessor via Accelerate (~2.1 TFLOP/s f32), implicit in every numpy op.
56
- - **ASR** = **Whisper on MLX** (no mic-permission needed). An **Any-to-Any omni-router** sends any input
57
- (text / image / audio / video / table) to its optimal block.
58
-
59
- ## The model factory (swappable domain souls)
60
- One 99 GB base + hot-swappable LoRA "souls" (~100 MB each) β€” change the model's specialty by swapping the
61
- adapter: **code Β· design Β· agentic Β· gamedev Β· legacy/enterprise Β· security Β· fullstack Β· science Β· data Β·
62
- perfumery**. Each is healed from the same base by an autonomous chain that forges the whole library overnight
63
- on the one Mac β€” and a `factory`-dispatcher soul makes the model route requests to the right specialty itself.
64
-
65
- ## Requirements
66
- - **Apple Silicon, 128 GB** unified memory (M5-class recommended), macOS 26/27+. **MLX β‰₯ 0.31.**
67
- - The architecture (`glm_moe_dsa`: MLA + DSA sparse attention) needs the **bundled patches** β€” stock
68
- mlx_lm can't load it.
69
- - **⚠️ Raise the GPU memory ceiling β€” required.** The model needs ~101.6 GB; macOS caps the GPU
70
- working set at ~110 GB by default, so it OOM-crashes (Metal command-buffer timeout) on long
71
- generations. Fix before serving:
72
- ```bash
73
- sudo sysctl iogpu.wired_limit_mb=122000 # 122 GB; one-shot (resets on reboot)
74
- sudo bash dist/install_gpu_limit.sh # OR: persist it via a LaunchDaemon
75
- ```
76
- Without this the model appears to "randomly crash" β€” it's just memory-starved.
77
-
78
- ## Use it
79
- ```bash
80
- python dist/install_glm_dsa_patch.py # patch mlx_lm (venv AND LM Studio's bundled engine)
81
- GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
82
- --adapter-path heal/adapters-v4 # serve (OpenAI-compatible); v2 + heal/adapters also ship
83
- # drive the 47-tool agent on your repo:
84
- python scripts/57_tool_agent.py --repo /path/to/your/repo --apply --task "..." --test "cargo test"
85
- # speed: try --dsa-block-size 32/64/128 (free, pick fastest). External draft is Metal-unstable here; MTP self-spec is the real path.
86
- ```
87
- In **LM Studio**: run the patch, fully quit + reopen, then load the model.
88
-
89
- ## Performance (M5 Max 128 GB, v4)
90
- | Metric | Value |
91
- |---|---|
92
- | Size | 99 GB (from 381 GB mxfp4 / ~1.5 TB bf16) |
93
- | HumanEval pass@1 | **19/20 (95%)**, single-shot |
94
- | Math GSM8K | **8/12 (66%)** β€” recovered from v3's **0/5** (code-first balanced calibration kept the math super-experts alive through the prune) |
95
- | Algebra (SymPy-checked) | **3/4 (75%)** |
96
- | Decode speed | **11.3 tok/s** (no draft) β€” see the speed note in limitations |
97
- | Verified-decode checker | TS 0.3 ms Β· Python ~0 ms Β· Rust 34 ms |
98
-
99
- ## Honest limitations
100
- - **Specialist:** ~70% of experts pruned β€” strong in the target niche, weaker outside it. Not the full 743B.
101
- - **Speed ~11 tok/s decode** (reading pace; ~3 min for long thinking-ON answers). Partly MLX's still-naive
102
- **DSA attention kernels** (mlx #837 / #3402 β€” *improves for free* as MLX matures), partly the bandwidth
103
- cost of a 743B-class MoE on a laptop. **Measured dead-ends** (don't bother): 4-bit re-quant is *slower*
104
- for single-token decode (bandwidth-bound, smaller wins); active-experts 8β†’4 gives no win at batch=1.
105
- **Real path:** `--dsa-block-size` sweep (free) β†’ upstream MLX β†’ **MTP self-speculative** (~2.6Γ—, a port
106
- for this arch). Not a quant change.
107
- - **Multilingual** ability reduced (optional vocab-trim drops ~31% of tokens).
108
- - **Design** is competent but not yet design-soul-elite (correct structure, but missed OKLCH/grid when
109
- tested) β€” the design-canon heal closes this.
110
- - Prompt-cache can OOM under heavy concurrent load. The external speculative draft is **Metal-unstable**
111
- on this MoE β€” **MTP self-speculative is the right path**; the external draft is not recommended.
112
-
113
- ## Attribution & license
114
- **MIT.** Base model Β© **Z.ai** (`zai-org/GLM-5.2`, MIT-licensed) β€” so this derivative is MIT too: free
115
- to use, modify, and redistribute **with attribution to Z.ai**. The demolition / healing / 47-tool agent
116
- tooling is this repo's contribution.
 
1
  ---
2
  license: mit
3
  base_model: zai-org/GLM-5.2
4
+ tags: [mlx, apple-silicon, moe, pruned, quantized, soul-targeted]
5
+ private: true
 
 
6
  ---
7
 
8
+ # GLM-5.2-Demolition q4a4-v3 (soul-targeted)
9
+
10
+ A demolition of **GLM-5.2** (743B total / 39B active MoE, MIT) down to a **~98 GB 4-bit** model that
11
+ loads on a single **M5 Max 128 GB**. v3's distinguishing move: **soul-targeted expert pruning** β€” the
12
+ kept experts are chosen by saliency measured on *our* facet data (code, design, math, security, gamedev,
13
+ agentic, retrieval), not a generic corpus.
14
+
15
+ ## The demolition lineage (honest)
16
+ | ver | prune | quant | size | result |
17
+ |-----|-------|-------|------|--------|
18
+ | v1 | keep 30% experts (generic) | 3-bit | 99 GB | broken β€” hallucinates, sentence-loops |
19
+ | v2 | keep 23% experts (code-calib) | 4-bit | 98 GB | design coherent; trivia gone (expected) |
20
+ | **v3** | keep 23% experts (**soul-calib**) | 4-bit | ~98 GB | _measured: TBD β€” see Eval_ |
21
+
22
+ ## Method
23
+ 1. **Saliency** (`23_stream_calibrate`) on `soul_calib_v3.jsonl` (1465 facet sequences) β€” streams the
24
+ 381 GB mxfp4 original layer-by-layer, scores each routed expert by activation on our souls.
25
+ 2. **Prune** (`24_apply_prune --ratio 0.77`) β€” keep the top-saliency experts per MoE layer.
26
+ 3. **Re-quantize** (`24b_stream_requantize --bits 4`) β€” uniform 4-bit experts, 4-bit attn, 6-bit head.
27
+ 4. **Heal** (`06_heal_lora`) β€” LoRA on the soul heal set (gold + design + flywheel verified-fixes).
28
+
29
+ ## Honest scope
30
+ - **Speed:** ~10 tok/s (M5 memory-bandwidth bound β€” inherent to a 98 GB model).
31
+ - **Strengths:** our facets (code, design, security, math). **Not** general trivia β€” those experts were
32
+ deliberately pruned. Perfumery questions will fail *by design*.
33
+ - **Reality check:** a clean right-sized model (Qwen3-Coder ~30B @4-bit) is faster *and* broader. This
34
+ artifact is the **best-possible demolition** of a 744B giant onto a laptop β€” a research result, not a
35
+ daily driver. The methods (REAP saliency, soul-targeting, the heal recipe) are the transferable value.
36
+
37
+ ## Eval (filled after measurement)
38
+ - design / code / security / math facet probes: _TBD_
39
+ - v3 (soul) vs v2 (code) on our facets: _TBD_
40
+ - tok/s: _TBD_
41
+
42
+ Built with the open pipeline at `glm52-demolition` (scripts 23/24/24b/06). Private until release (MIT).