Instructions to use avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw

Run Hermes

hermes

MLX LM

How to use avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

GLM-5.2-Alis-MLX-Dynamic-2.56bpw

Apple Silicon (MLX) mixed-precision quantization of zai-org/GLM-5.2 — a 744B-parameter (~40B active) Mixture-of-Experts model with DeepSeek-V3.2-style MLA + DeepSeek Sparse Attention (DSA, glm_moe_dsa). Quantized to ~2.56 bits/weight so the full model runs in ≤256 GB of unified memory.

⚠️ Requires a patched mlx-lm with the glm_moe_dsa indexer fixes (see Correctness below). The stock port is incomplete for GLM-5.2; loading there fails or degrades long-context output.

Metrics


Base model	zai-org/GLM-5.2 (744B total / ~40B active)
Bits/weight	~2.56 (per-tensor mixed)
On-disk size	237.9 GB (46 shards)
Peak memory	~238 GB (short ctx) · ~245 GB (8K ctx)
Format	MLX (Apple Silicon)
Context	up to 1M tokens (DSA sparse attention)

Why this model

GLM-5.2 is a frontier agentic-coding MoE, but at 744B it is ~1.5 TB in bf16 — out of reach for consumer memory, and existing MLX builds start at ~360 GB (≥4-bit, 512 GB-class machines). This build uses Unsloth-style per-tensor mixed precision: the routed experts (~97% of params) go to 2-bit while the sensitive paths keep higher precision, landing under 256 GB while preserving long-context retrieval and coding quality.

Quality

This is the ≤256 GB option — the routed experts are 2-bit, so it is deliberately bit-starved. If you have a 512 GB machine, the 3.5 bpw build is materially better (−32% wikitext PPL, −14% code) and still runs a full 1M context.

Strided perplexity from a fixed local harness — relative numbers for comparing these two builds, not directly comparable to perplexities other quantizers report on different corpora.

Benchmarks

Reproduced with mlx_lm.evaluate (0-shot) and mlx_lm.perplexity (seq 2048, 50 samples, seed 123), against the author's earlier GLM-5.1 quant under the same harness and settings:

	GLM-5.1 · 2.7 bpw	GLM-5.2 · 2.56 bpw (this)	GLM-5.2 · 3.5 bpw
Perplexity (lower)	4.165	3.850	3.766
HellaSwag (acc_norm)	0.606	0.636	0.610
PIQA (acc)	0.796	0.796	0.828
WinoGrande (acc)	0.660	0.708	0.766
Generation (tok/s)	18.35	22.87	21.29

Perplexity here is on allenai/tulu-3-sft-mixture (the mlx_lm.perplexity default) — a different corpus and method from the wikitext strided figure above, so values are not comparable across the two. Task accuracies use a 500-sample limit (CI ±0.02–0.04). GLM-5.1 is a different (older) base model, so cross-generation gaps reflect both the newer model and quantization.

Quantization recipe

Component	Bits	Notes
Routed experts (gate/up/down)	2-bit g64	~96% of params — the bulk
MLA attn · shared experts · dense MLP	4-bit g64	per-token critical path
Token embedding · LM head	6-bit g64	distribution-sensitive
Router (`mlp.gate`)	bf16	drives discrete top-8 routing
DSA lightning indexer	fp16	drives discrete top-k selection

Correctness (verified vs the HF reference)

GLM-5.2's glm_moe_dsa needed fixes beyond the stock mlx-lm port; this build was produced with a patched fork and validated:

IndexShare — the DSA indexer runs only on "full" layers; "shared" layers reuse its top-k (index_topk_freq=4). The stock port built an indexer on every layer → missing-weights / wrong >2048-token output.
Indexer RoPE/eps — the indexer uses non-interleaved (half-split) RoPE + LayerNorm eps 1e-6, distinct from the interleaved main attention. Post-RoPE q matches the HF reference to ~1e-7. Recorded in config.json (indexer_rope_traditional=false, indexer_norm_eps=1e-6).

Validation: full-attention logits match the HF reference to float precision at ≤index_topk context; needle retrieval succeeds through a 7,586-token prompt (sparse-DSA regime); coherent code generation; peak ≤256 GB.

Usage

# requires mlx-lm with the glm_moe_dsa indexer fixes
mlx_lm.generate --model avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw \
  --prompt "Write a quicksort in Python."

# OpenAI-compatible server
mlx_lm.server --model avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw

Hardware

Runs in ≤256 GB unified memory (Apple Silicon). On a 256 GB box the 238 GB of weights leave only ~18 GB for KV + OS (short/mid context); on a 512 GB M3 Ultra there is ample room for a long-context KV cache.

$Memory headroom: 238 GB weights are tight on a 256 GB machine (\~18 GB free, short/mid context) but roomy on 512 GB (\~274 GB free, 1M context)$

Credits

Base model: Zhipu / Z.ai — GLM-5.2 (MIT).
MLX & mlx-lm: Apple ml-explore.
Mixed-precision quantization + glm_moe_dsa correctness fixes: Alis (avlp12).

Citation

Alis (avlp12) (2026). GLM-5.2-Alis-MLX-Dynamic-2.56bpw — 2.56 bpw MLX quantization of GLM-5.2 for ≤256 GB Apple Silicon. https://huggingface.co/avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw

Downloads last month: 2,544

Safetensors

Model size

743B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Model tree for avlp12/GLM-5.2-Alis-MLX-Dynamic-2.56bpw

Base model

zai-org/GLM-5.2

Quantized

(74)

this model