Support this work → · X · GitHub · REAP paper · Cerebras REAP

GLM-5.1-555B-GGUF

GGUF quantization of zai-org/GLM-5.1.

At a glance

Base model zai-org/GLM-5.1
Format GGUF
Total params 555B
Active / token 14B
Experts / layer —
Layers —
Hidden size —
Context —
On-disk size 348 GB

Which variant should I pick?

Variant Format Link
GLM-5.1-444B BF16 link
GLM-5.1-444B-GGUF GGUF link
GLM-5.1-478B-NVFP4 NVFP4 link
GLM-5.1-555B BF16 link
GLM-5.1-555B-GGUF (this) GGUF link
GLM-5.1-555B-NVFP4 NVFP4 link
GLM-5.1-555B-W4A16 W4A16 link

This is a Q4_K_M quantized GGUF of the 25% expert-pruned zai-org/GLM-5.1 using REAP (Relative Expert Activation Pruning).

Property Value
Base model zai-org/GLM-5.1 (744B MoE, 256 experts/layer)
Architecture GlmMoeDsaForCausalLM (MoE + Dynamic Sparse Attention)
Routed experts 256 → 192 (25% removed, 64 per layer)
Active params/token ~14B (top-8 routing preserved)
Quantization Q4_K_M with Q8_0 protection for attention, router, shared expert, dense layers
GGUF size 325 GB (single file)
BF16 source 0xSero/GLM-5.1-555B

Benchmark Results (inference mode, temp=0.8)

Suite Metric Result Repetition Loops
Terminal-Bench (50) Proxy Pass 44/50 (88%) 0/50
SWE-bench Pro (50) Proxy Pass 33/50 (66%) 0/50
GSM8K (50) Correct 30/50 (60%) 0/50
HLE (50) Correct 9/50 (18%) 0/50

Zero repetition loops across 220 benchmark probes. This model completely eliminates the repetition degeneration that affected the more aggressively pruned 40% variant.

Degeneration Fuzz Test (45 probes)

Category Result
Code generation (15) 2/15 borderline (btree, sql_schema)
Structured output (4) 1/4 borderline (api_spec)
Reasoning (4) 0/4
Creative writing (4) 0/4
Math (2) 0/2
Domain knowledge (3) 0/3
Patch generation (3) 0/3
Overall 4/45 (8.9%) — all borderline

Why 25% instead of 40%?

The 40% pruned variant (444B, 154 experts/layer) suffered from repetition loops in ~29% of code/structured generation tasks. Root cause analysis showed the degeneration rate is determined by pruning aggressiveness — removing 40% of experts left too few for the model to maintain coherent long-form output. The 25% prune retains 192/256 experts, providing enough expert diversity for stable generation at all sequence lengths.

How to Use

# Requires llama.cpp with CUDA support
llama-server \
  -m glm51-555b-reap-Q4_K_M-protected.gguf \
  -ngl 99 -c 131072 -np 1 --alias glm51-q4 \
  --host 127.0.0.1 --port 8011 \
  --jinja --reasoning on --reasoning-format deepseek

Requires ~80-90 GiB VRAM per GPU across 4 GPUs, or ~325 GiB total.

Quantization Details

Protected at Q8_0 (NOT quantized to Q4):

  • Router gate weights + bias
  • DSA indexer weights
  • All attention projections + norms
  • Shared expert (gate, up, down)
  • Dense layers (first 3 layers)
  • Token embeddings + output head

Quantized to Q4_K / Q6_K:

  • Routed expert projections (gate, up → Q4_K; down → Q6_K)

Related Models

Model Prune % Experts Status
0xSero/GLM-5.1-555B 25% 192/256 BF16 source for this GGUF
0xSero/GLM-5.1-444B 40% 154/256 Has repetition issues — use 25% instead
0xSero/GLM-5.1-444B-GGUF 40% 154/256 BROKEN — repetition loops, deprecated

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Downloads last month
213
GGUF
Model size
563B params
Architecture
glm-dsa
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/GLM-5.1-555B-GGUF

Base model

zai-org/GLM-5.1
Quantized
(41)
this model

Space using 0xSero/GLM-5.1-555B-GGUF 1

Collections including 0xSero/GLM-5.1-555B-GGUF

Paper for 0xSero/GLM-5.1-555B-GGUF