Instructions to use Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest", filename="GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00001-of-00009.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL # Run inference directly in the terminal: llama cli -hf Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL # Run inference directly in the terminal: llama cli -hf Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL # Run inference directly in the terminal: ./llama-cli -hf Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL # Run inference directly in the terminal: ./build/bin/llama-cli -hf Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL
Use Docker
docker model run hf.co/Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL
- LM Studio
- Jan
- Ollama
How to use Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest with Ollama:
ollama run hf.co/Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL
- Unsloth Studio
How to use Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest to start chatting
- Pi
How to use Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest with Docker Model Runner:
docker model run hf.co/Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL
- Lemonade
How to use Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest:IQ4_NL
Run and chat with the model
lemonade run user.GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest-IQ4_NL
List all available models
lemonade list
GLM-5.2 โ ShortGPT-pruned, Mixed-Precision GGUF (IQ2_S experts ยท IQ4_NL rest)
This is a ShortGPT-pruned re-release of the GLM-5.2 mixed-precision GGUF.
Starting from zai-org/GLM-5.2 (256ร22B Mixture-of-Experts, architecture
glm-dsa), 12 Transformer blocks were removed by ShortGPT structured layer
pruning (block count 79 โ 67), and the surviving weights were then
re-quantized with llama-quantize's per-tensor mixed-precision workflow using an
importance matrix. The MoE expert tensors are stored at IQ2_S (โ2.6 BPW
overall) while the dense / attention / norm / embedding / shared-head tensors
stay at IQ4_NL.
The goal is the smallest practical memory footprint: the pruned + low-bit model is โ191 GiB, roughly 18 % smaller than the un-pruned mixed-precision release, while keeping the same exact quantization scheme per tensor.
Model particulars (from GGUF KV metadata)
| Key | Value |
|---|---|
| Architecture | glm-dsa |
| Name / version | GLM-5.2 / 5.2 |
| Size label | 256x22B (256 experts, 8 active, 1 shared) |
| Block count | 67 (was 79 before ShortGPT pruning; 12 blocks dropped) |
| Leading dense blocks | 3 |
| Context length | 1,048,576 (1M tokens) |
| Embedding length | 6144 |
| Feed-forward length (dense) | 12288 |
| Expert FF length | 2048 |
| Attention heads / KV heads | 64 / 1 (MLA, q_lora_rank=2048, kv_lora_rank=512, key_length_mla=256, value_length_mla=256) |
| RoPE base / dim | 8,000,000 / 64 |
| Vocabulary | 154,880 (tokenizer glm4 / gpt2) |
| Expert gating | func=2, weights_scale=2.5, weights_norm=true |
| NextN predict layers | 1 |
| License | MIT |
Pruning
ShortGPT evaluates the importance of each decoder block (via cosine similarity of
inputs/outputs) and drops the lowest-importance blocks. On GLM-5.2 this removed
12 blocks (79 โ 67), reducing both parameter count and activation memory. Layer
indices are sparse afterwards โ the retained blocks keep their original indices
rather than being renumbered, so the file reports block_count=67.
Quantization mapping
Per-tensor type assignment passed to llama_quantize (same scheme as the
sibling un-pruned release):
| Tensor pattern | Quant |
|---|---|
ffn_gate_exps (most blocks) |
IQ2_S |
ffn_up_exps (most blocks) |
IQ2_S |
ffn_down_exps (most blocks) |
IQ2_S |
blk.78.ffn_*_exps (last MoE block, no separate weights) |
IQ4_NL |
| everything else (attention, norms, embeddings, shared head, indexer) | IQ4_NL |
- Source GGUF:
unsloth/GLM-5.2-GGUFIQ4_NL variant, pruned and re-quantized withallow-requantize+keep-split. - Importance matrix:
imatrix_unsloth.gguf(sourced from Unsloth). - Final size: โ191 GiB across 9 shards, โ2.6 BPW.
Files
Filenames include the IQ2_S / IQ4_NL quant tokens so Hugging Face's
quantization-variant scanner recognizes the shards (a single quant label is not
possible for a mixed-precision quant; both constituent quants are listed).
| File | Approx. size |
|---|---|
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00001-of-00009.gguf |
9.0 MiB (headers/tokenizer) |
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00002-of-00009.gguf |
20.9 GiB |
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00003-of-00009.gguf |
31.0 GiB |
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00004-of-00009.gguf |
31.1 GiB |
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00005-of-00009.gguf |
31.0 GiB |
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00006-of-00009.gguf |
22.9 GiB |
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00007-of-00009.gguf |
18.0 GiB |
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00008-of-00009.gguf |
25.1 GiB |
GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00009-of-00009.gguf |
10.9 GiB |
Usage
Load with any recent llama.cpp build (and compatible runners โ LM Studio, Ollama,
koboldcpp, etc.) that supports the glm-dsa architecture, MLA attention and IQ2_S /
IQ4_NL dequantization (GPU offload strongly recommended).
llama-server \
-m GLM-5.2-shortgpt-pruned-IQ2_S-IQ4_NL-00001-of-00009.gguf \
--host 0.0.0.0 --port 8080 \
-ngl 999 -c 8192
The first shard is the entry point;
llama.cppfollows the split-file links to load all 9 shards automatically. Point-mat00001-of-00009.
Provenance
- Base model: zai-org/GLM-5.2 โ MIT.
- Source GGUF quantization: Unsloth (
general.quantized_by = Unsloth,general.repo_url = https://huggingface.co/unsloth). - ShortGPT pruning + mixed-precision re-quant with imatrix: Deviad (2026-06-21),
on Apple M3 Ultra (Metal build of
llama.cpp).
Disclaimer
This is an aggressive low-bit quantization of an already-pruned model, intended to fit a very large MoE into constrained memory. Expect measurable quality degradation versus the source, both from ShortGPT layer removal and from the IQ2_S expert tensors. Validate on your own tasks before relying on it.
- Downloads last month
- 33
4-bit
Model tree for Deviad/GLM-5.2-shortgpt-pruned-IQ2S-experts-IQ4NL-rest
Base model
zai-org/GLM-5.2