--- language: - en - zh license: mit library_name: mlx pipeline_tag: text-generation tags: - glm_moe_dsa - quantized - glm - moe - apple-silicon - mixed-precision - 2-bit - conversational base_model: zai-org/GLM-5.2 --- # GLM-5.2-Alis-MLX-Dynamic-2.56bpw Apple Silicon (MLX) **mixed-precision** quantization of [zai-org/GLM-5.2](https://huggingface.co/zai-org/GLM-5.2) — a 744B-parameter (\~40B active) Mixture-of-Experts model with DeepSeek-V3.2-style MLA + DeepSeek Sparse Attention (DSA, `glm_moe_dsa`). Quantized to **\~2.56 bits/weight** so the full model runs in **≤256 GB of unified memory**. > ⚠️ **Requires a patched `mlx-lm`** with the `glm_moe_dsa` indexer fixes (see *Correctness* below). The stock port is incomplete for GLM-5.2; loading there fails or degrades long-context output. ## Metrics | | | |---|---| | Base model | zai-org/GLM-5.2 (744B total / \~40B active) | | Bits/weight | **\~2.56** (per-tensor mixed) | | On-disk size | **237.9 GB** (46 shards) | | Peak memory | \~238 GB (short ctx) · \~245 GB (8K ctx) | | Format | MLX (Apple Silicon) | | Context | up to 1M tokens (DSA sparse attention) | ## Why this model GLM-5.2 is a frontier agentic-coding MoE, but at 744B it is \~1.5 TB in bf16 — out of reach for consumer memory, and existing MLX builds start at \~360 GB (≥4-bit, 512 GB-class machines). This build uses **Unsloth-style per-tensor mixed precision**: the routed experts (\~97% of params) go to 2-bit while the sensitive paths keep higher precision, landing **under 256 GB** while preserving long-context retrieval and coding quality. ![On-disk footprint across GLM-5.2 MLX builds: this 2.56 bpw build (238 GB) is the only one that fits a ≤256 GB machine; golden 328, mixed-3_6 360, Q4.8-INF 447, DQ4plus 465 all need 512 GB-class](assets/field.png) ## Quality This is the **≤256 GB** option — the routed experts are 2-bit, so it is deliberately bit-starved. If you have a 512 GB machine, the **[3.5 bpw build](https://huggingface.co/avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw)** is materially better (−32% wikitext PPL, −14% code) and still runs a full 1M context. ![Perplexity: this 2.56 bpw build vs the 3.5 bpw build — 2.56 bpw is 32% higher (worse) on wikitext, 14% on code](assets/quality.png) *Strided perplexity from a fixed local harness — relative numbers for comparing these two builds, not directly comparable to perplexities other quantizers report on different corpora.* ## Benchmarks Reproduced with `mlx_lm.evaluate` (0-shot) and `mlx_lm.perplexity` (seq 2048, 50 samples, seed 123), against the author's earlier GLM-5.1 quant under the same harness and settings: | | GLM-5.1 · 2.7 bpw | **GLM-5.2 · 2.56 bpw (this)** | GLM-5.2 · 3.5 bpw | |---|---|---|---| | Perplexity (lower) | 4.165 | **3.850** | 3.766 | | HellaSwag (acc_norm) | 0.606 | **0.636** | 0.610 | | PIQA (acc) | 0.796 | **0.796** | 0.828 | | WinoGrande (acc) | 0.660 | **0.708** | 0.766 | | Generation (tok/s) | 18.35 | **22.87** | 21.29 | Perplexity here is on `allenai/tulu-3-sft-mixture` (the `mlx_lm.perplexity` default) — a different corpus and method from the wikitext strided figure above, so values are not comparable across the two. Task accuracies use a 500-sample limit (CI ±0.02–0.04). GLM-5.1 is a different (older) base model, so cross-generation gaps reflect both the newer model and quantization. **Quantization recipe** ![Mixed-precision recipe: experts 2-bit, MLA/shared/dense 4-bit, embed/head 6-bit, router bf16, indexer fp16](assets/recipe.png) | Component | Bits | Notes | |---|---|---| | Routed experts (gate/up/down) | 2-bit g64 | \~96% of params — the bulk | | MLA attn · shared experts · dense MLP | 4-bit g64 | per-token critical path | | Token embedding · LM head | 6-bit g64 | distribution-sensitive | | Router (`mlp.gate`) | bf16 | drives discrete top-8 routing | | DSA lightning indexer | fp16 | drives discrete top-k selection | ## Correctness (verified vs the HF reference) GLM-5.2's `glm_moe_dsa` needed fixes beyond the stock mlx-lm port; this build was produced with a patched fork and validated: - **IndexShare** — the DSA indexer runs only on "full" layers; "shared" layers reuse its top-k (`index_topk_freq=4`). The stock port built an indexer on every layer → missing-weights / wrong >2048-token output. - **Indexer RoPE/eps** — the indexer uses **non-interleaved (half-split) RoPE + LayerNorm eps 1e-6**, distinct from the interleaved main attention. Post-RoPE `q` matches the HF reference to \~1e-7. Recorded in `config.json` (`indexer_rope_traditional=false`, `indexer_norm_eps=1e-6`). **Validation:** full-attention logits match the HF reference to float precision at ≤index_topk context; **needle retrieval succeeds through a 7,586-token prompt** (sparse-DSA regime); coherent code generation; peak ≤256 GB. ## Usage ```bash # requires mlx-lm with the glm_moe_dsa indexer fixes mlx_lm.generate --model avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw \ --prompt "Write a quicksort in Python." # OpenAI-compatible server mlx_lm.server --model avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw ``` ## Hardware Runs in **≤256 GB unified memory** (Apple Silicon). On a 256 GB box the 238 GB of weights leave only \~18 GB for KV + OS (short/mid context); on a 512 GB M3 Ultra there is ample room for a long-context KV cache. ![Memory headroom: 238 GB weights are tight on a 256 GB machine (\~18 GB free, short/mid context) but roomy on 512 GB (\~274 GB free, 1M context)](assets/memory.png) ## Credits - Base model: **Zhipu / Z.ai — GLM-5.2** (MIT). - **MLX** & **mlx-lm**: Apple ml-explore. - Mixed-precision quantization + `glm_moe_dsa` correctness fixes: **Alis (avlp12)**. ## Citation > **Alis (avlp12)** (2026). *GLM-5.2-Alis-MLX-Dynamic-2.56bpw* — 2.56 bpw MLX quantization of [GLM-5.2](https://huggingface.co/zai-org/GLM-5.2) for ≤256 GB Apple Silicon. >