---
language:
- en
- zh
license: mit
library_name: mlx
pipeline_tag: text-generation
tags:
- glm_moe_dsa
- quantized
- glm
- moe
- apple-silicon
- mixed-precision
- 2-bit
- conversational
base_model: zai-org/GLM-5.2
---

# GLM-5.2-Alis-MLX-Dynamic-2.56bpw

Apple Silicon (MLX) **mixed-precision** quantization of [zai-org/GLM-5.2](https://huggingface.co/zai-org/GLM-5.2) — a 744B-parameter (\~40B active) Mixture-of-Experts model with DeepSeek-V3.2-style MLA + DeepSeek Sparse Attention (DSA, `glm_moe_dsa`). Quantized to **\~2.56 bits/weight** so the full model runs in **≤256 GB of unified memory**.

> ⚠️ **Requires a patched `mlx-lm`** with the `glm_moe_dsa` indexer fixes (see *Correctness* below). The stock port is incomplete for GLM-5.2; loading there fails or degrades long-context output.

## Metrics

| | |
|---|---|
| Base model | zai-org/GLM-5.2 (744B total / \~40B active) |
| Bits/weight | **\~2.56** (per-tensor mixed) |
| On-disk size | **237.9 GB** (46 shards) |
| Peak memory | \~238 GB (short ctx) · \~245 GB (8K ctx) |
| Format | MLX (Apple Silicon) |
| Context | up to 1M tokens (DSA sparse attention) |

## Why this model

GLM-5.2 is a frontier agentic-coding MoE, but at 744B it is \~1.5 TB in bf16 — out of reach for consumer memory, and existing MLX builds start at \~360 GB (≥4-bit, 512 GB-class machines). This build uses **Unsloth-style per-tensor mixed precision**: the routed experts (\~97% of params) go to 2-bit while the sensitive paths keep higher precision, landing **under 256 GB** while preserving long-context retrieval and coding quality.

![On-disk footprint across GLM-5.2 MLX builds: this 2.56 bpw build (238 GB) is the only one that fits a ≤256 GB machine; golden 328, mixed-3_6 360, Q4.8-INF 447, DQ4plus 465 all need 512 GB-class](assets/field.png)

## Quality

This is the **≤256 GB** option — the routed experts are 2-bit, so it is deliberately bit-starved. If you have a 512 GB machine, the **[3.5 bpw build](https://huggingface.co/avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw)** is materially better (−32% wikitext PPL, −14% code) and still runs a full 1M context.

![Perplexity: this 2.56 bpw build vs the 3.5 bpw build — 2.56 bpw is 32% higher (worse) on wikitext, 14% on code](assets/quality.png)

*Strided perplexity from a fixed local harness — relative numbers for comparing these two builds, not directly comparable to perplexities other quantizers report on different corpora.*

## Benchmarks

Reproduced with `mlx_lm.evaluate` (0-shot) and `mlx_lm.perplexity` (seq 2048, 50 samples, seed 123), against the author's earlier GLM-5.1 quant under the same harness and settings:

| | GLM-5.1 · 2.7 bpw | **GLM-5.2 · 2.56 bpw (this)** | GLM-5.2 · 3.5 bpw |
|---|---|---|---|
| Perplexity (lower) | 4.165 | **3.850** | 3.766 |
| HellaSwag (acc_norm) | 0.606 | **0.636** | 0.610 |
| PIQA (acc) | 0.796 | **0.796** | 0.828 |
| WinoGrande (acc) | 0.660 | **0.708** | 0.766 |
| Generation (tok/s) | 18.35 | **22.87** | 21.29 |

Perplexity here is on `allenai/tulu-3-sft-mixture` (the `mlx_lm.perplexity` default) — a different corpus and method from the wikitext strided figure above, so values are not comparable across the two. Task accuracies use a 500-sample limit (CI ±0.02–0.04). GLM-5.1 is a different (older) base model, so cross-generation gaps reflect both the newer model and quantization.

**Quantization recipe**

![Mixed-precision recipe: experts 2-bit, MLA/shared/dense 4-bit, embed/head 6-bit, router bf16, indexer fp16](assets/recipe.png)

| Component | Bits | Notes |
|---|---|---|
| Routed experts (gate/up/down) | 2-bit g64 | \~96% of params — the bulk |
| MLA attn · shared experts · dense MLP | 4-bit g64 | per-token critical path |
| Token embedding · LM head | 6-bit g64 | distribution-sensitive |
| Router (`mlp.gate`) | bf16 | drives discrete top-8 routing |
| DSA lightning indexer | fp16 | drives discrete top-k selection |

## Correctness (verified vs the HF reference)

GLM-5.2's `glm_moe_dsa` needed fixes beyond the stock mlx-lm port; this build was produced with a patched fork and validated:

- **IndexShare** — the DSA indexer runs only on "full" layers; "shared" layers reuse its top-k (`index_topk_freq=4`). The stock port built an indexer on every layer → missing-weights / wrong >2048-token output.
- **Indexer RoPE/eps** — the indexer uses **non-interleaved (half-split) RoPE + LayerNorm eps 1e-6**, distinct from the interleaved main attention. Post-RoPE `q` matches the HF reference to \~1e-7. Recorded in `config.json` (`indexer_rope_traditional=false`, `indexer_norm_eps=1e-6`).

**Validation:** full-attention logits match the HF reference to float precision at ≤index_topk context; **needle retrieval succeeds through a 7,586-token prompt** (sparse-DSA regime); coherent code generation; peak ≤256 GB.

## Usage

```bash
# requires mlx-lm with the glm_moe_dsa indexer fixes
mlx_lm.generate --model avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw \
  --prompt "Write a quicksort in Python."

# OpenAI-compatible server
mlx_lm.server --model avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw
```

## Hardware

Runs in **≤256 GB unified memory** (Apple Silicon). On a 256 GB box the 238 GB of weights leave only \~18 GB for KV + OS (short/mid context); on a 512 GB M3 Ultra there is ample room for a long-context KV cache.

![Memory headroom: 238 GB weights are tight on a 256 GB machine (\~18 GB free, short/mid context) but roomy on 512 GB (\~274 GB free, 1M context)](assets/memory.png)

## Credits

- Base model: **Zhipu / Z.ai — GLM-5.2** (MIT).
- **MLX** & **mlx-lm**: Apple ml-explore.
- Mixed-precision quantization + `glm_moe_dsa` correctness fixes: **Alis (avlp12)**.

## Citation

> **Alis (avlp12)** (2026). *GLM-5.2-Alis-MLX-Dynamic-2.56bpw* — 2.56 bpw MLX quantization of [GLM-5.2](https://huggingface.co/zai-org/GLM-5.2) for ≤256 GB Apple Silicon.
> <https://huggingface.co/avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw>