Instructions to use avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw
Run Hermes
hermes
- MLX LM
How to use avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw", "messages": [ {"role": "user", "content": "Hello"} ] }'
Run an OpenAI-compatible server
# Install MLX LM
uv tool install mlx-lm# Start the server
mlx_lm.server --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw",
"messages": [
{"role": "user", "content": "Hello"}
]
}'GLM-5.2-Alis-MLX-Dynamic-2.56bpw
Apple Silicon (MLX) mixed-precision quantization of zai-org/GLM-5.2 — a 744B-parameter (~40B active) Mixture-of-Experts model with DeepSeek-V3.2-style MLA + DeepSeek Sparse Attention (DSA, glm_moe_dsa). Quantized to ~2.56 bits/weight so the full model runs in ≤256 GB of unified memory.
⚠️ Requires a patched
mlx-lmwith theglm_moe_dsaindexer fixes (see Correctness below). The stock port is incomplete for GLM-5.2; loading there fails or degrades long-context output.
Metrics
| Base model | zai-org/GLM-5.2 (744B total / ~40B active) |
| Bits/weight | ~2.56 (per-tensor mixed) |
| On-disk size | 237.9 GB (46 shards) |
| Peak memory | ~238 GB (short ctx) · ~245 GB (8K ctx) |
| Format | MLX (Apple Silicon) |
| Context | up to 1M tokens (DSA sparse attention) |
Why this model
GLM-5.2 is a frontier agentic-coding MoE, but at 744B it is ~1.5 TB in bf16 — out of reach for consumer memory, and existing MLX builds start at ~360 GB (≥4-bit, 512 GB-class machines). This build uses Unsloth-style per-tensor mixed precision: the routed experts (~97% of params) go to 2-bit while the sensitive paths keep higher precision, landing under 256 GB while preserving long-context retrieval and coding quality.
Quality
This is the ≤256 GB option — the routed experts are 2-bit, so it is deliberately bit-starved. If you have a 512 GB machine, the 3.5 bpw build is materially better (−32% wikitext PPL, −14% code) and still runs a full 1M context.
Strided perplexity from a fixed local harness — relative numbers for comparing these two builds, not directly comparable to perplexities other quantizers report on different corpora.
Benchmarks
Reproduced with mlx_lm.evaluate (0-shot) and mlx_lm.perplexity (seq 2048, 50 samples, seed 123), against the author's earlier GLM-5.1 quant under the same harness and settings:
| GLM-5.1 · 2.7 bpw | GLM-5.2 · 2.56 bpw (this) | GLM-5.2 · 3.5 bpw | |
|---|---|---|---|
| Perplexity (lower) | 4.165 | 3.850 | 3.766 |
| HellaSwag (acc_norm) | 0.606 | 0.636 | 0.610 |
| PIQA (acc) | 0.796 | 0.796 | 0.828 |
| WinoGrande (acc) | 0.660 | 0.708 | 0.766 |
| Generation (tok/s) | 18.35 | 22.87 | 21.29 |
Perplexity here is on allenai/tulu-3-sft-mixture (the mlx_lm.perplexity default) — a different corpus and method from the wikitext strided figure above, so values are not comparable across the two. Task accuracies use a 500-sample limit (CI ±0.02–0.04). GLM-5.1 is a different (older) base model, so cross-generation gaps reflect both the newer model and quantization.
Quantization recipe
| Component | Bits | Notes |
|---|---|---|
| Routed experts (gate/up/down) | 2-bit g64 | ~96% of params — the bulk |
| MLA attn · shared experts · dense MLP | 4-bit g64 | per-token critical path |
| Token embedding · LM head | 6-bit g64 | distribution-sensitive |
Router (mlp.gate) |
bf16 | drives discrete top-8 routing |
| DSA lightning indexer | fp16 | drives discrete top-k selection |
Correctness (verified vs the HF reference)
GLM-5.2's glm_moe_dsa needed fixes beyond the stock mlx-lm port; this build was produced with a patched fork and validated:
- IndexShare — the DSA indexer runs only on "full" layers; "shared" layers reuse its top-k (
index_topk_freq=4). The stock port built an indexer on every layer → missing-weights / wrong >2048-token output. - Indexer RoPE/eps — the indexer uses non-interleaved (half-split) RoPE + LayerNorm eps 1e-6, distinct from the interleaved main attention. Post-RoPE
qmatches the HF reference to ~1e-7. Recorded inconfig.json(indexer_rope_traditional=false,indexer_norm_eps=1e-6).
Validation: full-attention logits match the HF reference to float precision at ≤index_topk context; needle retrieval succeeds through a 7,586-token prompt (sparse-DSA regime); coherent code generation; peak ≤256 GB.
Usage
# requires mlx-lm with the glm_moe_dsa indexer fixes
mlx_lm.generate --model avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw \
--prompt "Write a quicksort in Python."
# OpenAI-compatible server
mlx_lm.server --model avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw
Hardware
Runs in ≤256 GB unified memory (Apple Silicon). On a 256 GB box the 238 GB of weights leave only ~18 GB for KV + OS (short/mid context); on a 512 GB M3 Ultra there is ample room for a long-context KV cache.
Credits
- Base model: Zhipu / Z.ai — GLM-5.2 (MIT).
- MLX & mlx-lm: Apple ml-explore.
- Mixed-precision quantization +
glm_moe_dsacorrectness fixes: Alis (avlp12).
Citation
Alis (avlp12) (2026). GLM-5.2-Alis-MLX-Dynamic-2.56bpw — 2.56 bpw MLX quantization of GLM-5.2 for ≤256 GB Apple Silicon. https://huggingface.co/avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw
- Downloads last month
- 2,544
4-bit
Model tree for avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw
Base model
zai-org/GLM-5.2



Generate or start a chat session
# Install MLX LM uv tool install mlx-lm# Interactive chat REPL mlx_lm.chat --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw"