How to use from
Ollama
ollama run hf.co/Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL
Quick Links

GLM-5.2 โ€” ShortGPT-pruned, Mixed-Precision GGUF (IQ2_S experts ยท IQ4_NL rest)

This is a ShortGPT-pruned re-release of the GLM-5.2 mixed-precision GGUF.

Starting from zai-org/GLM-5.2 (256ร—22B Mixture-of-Experts, architecture glm-dsa), 12 Transformer blocks were removed by ShortGPT structured layer pruning (block count 79 โ†’ 67), and the surviving weights were then re-quantized with llama-quantize's per-tensor mixed-precision workflow using an importance matrix. The MoE expert tensors are stored at IQ2_S (โ‰ˆ2.6 BPW overall) while the dense / attention / norm / embedding / shared-head tensors stay at IQ4_NL.

The goal is the smallest practical memory footprint: the pruned + low-bit model is โ‰ˆ191 GiB, roughly 18 % smaller than the un-pruned mixed-precision release, while keeping the same exact quantization scheme per tensor.

Model particulars (from GGUF KV metadata)

Key Value
Architecture glm-dsa
Name / version GLM-5.2 / 5.2
Size label 256x22B (256 experts, 8 active, 1 shared)
Block count 67 (was 79 before ShortGPT pruning; 12 blocks dropped)
Leading dense blocks 3
Context length 1,048,576 (1M tokens)
Embedding length 6144
Feed-forward length (dense) 12288
Expert FF length 2048
Attention heads / KV heads 64 / 1 (MLA, q_lora_rank=2048, kv_lora_rank=512, key_length_mla=256, value_length_mla=256)
RoPE base / dim 8,000,000 / 64
Vocabulary 154,880 (tokenizer glm4 / gpt2)
Expert gating func=2, weights_scale=2.5, weights_norm=true
NextN predict layers 1
License MIT

Pruning

ShortGPT evaluates the importance of each decoder block (via cosine similarity of inputs/outputs) and drops the lowest-importance blocks. On GLM-5.2 this removed 12 blocks (79 โ†’ 67), reducing both parameter count and activation memory. Layer indices are sparse afterwards โ€” the retained blocks keep their original indices rather than being renumbered, so the file reports block_count=67.

Quantization mapping

Per-tensor type assignment passed to llama_quantize (same scheme as the sibling un-pruned release):

Tensor pattern Quant
ffn_gate_exps (most blocks) IQ2_S
ffn_up_exps (most blocks) IQ2_S
ffn_down_exps (most blocks) IQ2_S
blk.78.ffn_*_exps (last MoE block, no separate weights) IQ4_NL
everything else (attention, norms, embeddings, shared head, indexer) IQ4_NL
  • Source GGUF: unsloth/GLM-5.2-GGUF IQ4_NL variant, pruned and re-quantized with allow-requantize + keep-split.
  • Importance matrix: imatrix_unsloth.gguf (sourced from Unsloth).
  • Final size: โ‰ˆ191 GiB across 9 shards, โ‰ˆ2.6 BPW.

Files

Filenames include the IQ2_S / IQ4_NL quant tokens so Hugging Face's quantization-variant scanner recognizes the shards (a single quant label is not possible for a mixed-precision quant; both constituent quants are listed).

File Approx. size
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00001-of-00009.gguf 9.0 MiB (headers/tokenizer)
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00002-of-00009.gguf 20.9 GiB
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00003-of-00009.gguf 31.0 GiB
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00004-of-00009.gguf 31.1 GiB
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00005-of-00009.gguf 31.0 GiB
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00006-of-00009.gguf 22.9 GiB
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00007-of-00009.gguf 18.0 GiB
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00008-of-00009.gguf 25.1 GiB
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00009-of-00009.gguf 10.9 GiB

Usage

Load with any recent llama.cpp build (and compatible runners โ€” LM Studio, Ollama, koboldcpp, etc.) that supports the glm-dsa architecture, MLA attention and IQ2_S / IQ4_NL dequantization (GPU offload strongly recommended).

llama-server \
  -m GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00001-of-00009.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 999 -c 8192

The first shard is the entry point; llama.cpp follows the split-file links to load all 9 shards automatically. Point -m at 00001-of-00009.

Provenance

  • Base model: zai-org/GLM-5.2 โ€” MIT.
  • Source GGUF quantization: Unsloth (general.quantized_by = Unsloth, general.repo_url = https://huggingface.co/unsloth).
  • ShortGPT pruning + mixed-precision re-quant with imatrix: Deviad (2026-06-21), on Apple M3 Ultra (Metal build of llama.cpp).

Disclaimer

This is an aggressive low-bit quantization of an already-pruned model, intended to fit a very large MoE into constrained memory. Expect measurable quality degradation versus the source, both from ShortGPT layer removal and from the IQ2_S expert tensors. Validate on your own tasks before relying on it.

Downloads last month
33
GGUF
Model size
625B params
Architecture
glm-dsa
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest

Base model

zai-org/GLM-5.2
Quantized
(74)
this model