How to use from
Hermes Agent
Start the MLX server
# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw"
Configure Hermes
# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw
Run Hermes
hermes
Quick Links

GLM-5.2-Alis-MLX-Dynamic-2.56bpw

Apple Silicon (MLX) mixed-precision quantization of zai-org/GLM-5.2 — a 744B-parameter (~40B active) Mixture-of-Experts model with DeepSeek-V3.2-style MLA + DeepSeek Sparse Attention (DSA, glm_moe_dsa). Quantized to ~2.56 bits/weight so the full model runs in ≤256 GB of unified memory.

⚠️ Requires a patched mlx-lm with the glm_moe_dsa indexer fixes (see Correctness below). The stock port is incomplete for GLM-5.2; loading there fails or degrades long-context output.

Metrics

Base model zai-org/GLM-5.2 (744B total / ~40B active)
Bits/weight ~2.56 (per-tensor mixed)
On-disk size 237.9 GB (46 shards)
Peak memory ~238 GB (short ctx) · ~245 GB (8K ctx)
Format MLX (Apple Silicon)
Context up to 1M tokens (DSA sparse attention)

Why this model

GLM-5.2 is a frontier agentic-coding MoE, but at 744B it is ~1.5 TB in bf16 — out of reach for consumer memory, and existing MLX builds start at ~360 GB (≥4-bit, 512 GB-class machines). This build uses Unsloth-style per-tensor mixed precision: the routed experts (~97% of params) go to 2-bit while the sensitive paths keep higher precision, landing under 256 GB while preserving long-context retrieval and coding quality.

On-disk footprint across GLM-5.2 MLX builds: this 2.56 bpw build (238 GB) is the only one that fits a ≤256 GB machine; golden 328, mixed-3_6 360, Q4.8-INF 447, DQ4plus 465 all need 512 GB-class

Quality

This is the ≤256 GB option — the routed experts are 2-bit, so it is deliberately bit-starved. If you have a 512 GB machine, the 3.5 bpw build is materially better (−32% wikitext PPL, −14% code) and still runs a full 1M context.

Perplexity: this 2.56 bpw build vs the 3.5 bpw build — 2.56 bpw is 32% higher (worse) on wikitext, 14% on code

Strided perplexity from a fixed local harness — relative numbers for comparing these two builds, not directly comparable to perplexities other quantizers report on different corpora.

Benchmarks

Reproduced with mlx_lm.evaluate (0-shot) and mlx_lm.perplexity (seq 2048, 50 samples, seed 123), against the author's earlier GLM-5.1 quant under the same harness and settings:

GLM-5.1 · 2.7 bpw GLM-5.2 · 2.56 bpw (this) GLM-5.2 · 3.5 bpw
Perplexity (lower) 4.165 3.850 3.766
HellaSwag (acc_norm) 0.606 0.636 0.610
PIQA (acc) 0.796 0.796 0.828
WinoGrande (acc) 0.660 0.708 0.766
Generation (tok/s) 18.35 22.87 21.29

Perplexity here is on allenai/tulu-3-sft-mixture (the mlx_lm.perplexity default) — a different corpus and method from the wikitext strided figure above, so values are not comparable across the two. Task accuracies use a 500-sample limit (CI ±0.02–0.04). GLM-5.1 is a different (older) base model, so cross-generation gaps reflect both the newer model and quantization.

Quantization recipe

Mixed-precision recipe: experts 2-bit, MLA/shared/dense 4-bit, embed/head 6-bit, router bf16, indexer fp16

Component Bits Notes
Routed experts (gate/up/down) 2-bit g64 ~96% of params — the bulk
MLA attn · shared experts · dense MLP 4-bit g64 per-token critical path
Token embedding · LM head 6-bit g64 distribution-sensitive
Router (mlp.gate) bf16 drives discrete top-8 routing
DSA lightning indexer fp16 drives discrete top-k selection

Correctness (verified vs the HF reference)

GLM-5.2's glm_moe_dsa needed fixes beyond the stock mlx-lm port; this build was produced with a patched fork and validated:

  • IndexShare — the DSA indexer runs only on "full" layers; "shared" layers reuse its top-k (index_topk_freq=4). The stock port built an indexer on every layer → missing-weights / wrong >2048-token output.
  • Indexer RoPE/eps — the indexer uses non-interleaved (half-split) RoPE + LayerNorm eps 1e-6, distinct from the interleaved main attention. Post-RoPE q matches the HF reference to ~1e-7. Recorded in config.json (indexer_rope_traditional=false, indexer_norm_eps=1e-6).

Validation: full-attention logits match the HF reference to float precision at ≤index_topk context; needle retrieval succeeds through a 7,586-token prompt (sparse-DSA regime); coherent code generation; peak ≤256 GB.

Usage

# requires mlx-lm with the glm_moe_dsa indexer fixes
mlx_lm.generate --model avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw \
  --prompt "Write a quicksort in Python."

# OpenAI-compatible server
mlx_lm.server --model avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw

Hardware

Runs in ≤256 GB unified memory (Apple Silicon). On a 256 GB box the 238 GB of weights leave only ~18 GB for KV + OS (short/mid context); on a 512 GB M3 Ultra there is ample room for a long-context KV cache.

Memory headroom: 238 GB weights are tight on a 256 GB machine (\~18 GB free, short/mid context) but roomy on 512 GB (\~274 GB free, 1M context)

Credits

  • Base model: Zhipu / Z.ai — GLM-5.2 (MIT).
  • MLX & mlx-lm: Apple ml-explore.
  • Mixed-precision quantization + glm_moe_dsa correctness fixes: Alis (avlp12).

Citation

Alis (avlp12) (2026). GLM-5.2-Alis-MLX-Dynamic-2.56bpw — 2.56 bpw MLX quantization of GLM-5.2 for ≤256 GB Apple Silicon. https://huggingface.co/avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw

Downloads last month
2,544
Safetensors
Model size
743B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw

Base model

zai-org/GLM-5.2
Quantized
(74)
this model