Instructions to use brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172") model = AutoModelForCausalLM.from_pretrained("brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172
- SGLang
How to use brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 with Docker Model Runner:
docker model run hf.co/brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172
Use Docker
docker model run hf.co/brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172GLM-5.2-NVFP4-REAP-Recall (N=172)
A recall-recovered re-REAP of GLM-5.2-NVFP4 — 172 experts/layer (from 256), self-consistent, BF16 weights preserved, NVFP4 experts sliced byte-for-byte.
- Recipe + serving stack: github.com/brandonmmusic-max/GLM-5.2-Reap
- Docker image:
verdictai/glm52-nvfp4-dcpmtp:v3.3 - Saliency methodology + corpus ledger:
REAP_RECALL_VERDICT.md - Verified launch script:
serve_glm52_reap_recall.sh
Attribution: GLM-5.2 by z.ai · NVFP4 quantization by Luke Alonso · REAP by Cerebras Research (arXiv:2510.13999, ICLR 2026).
Why this exists
Standard REAP compresses GLM-5.2 well for code, agentic, and reasoning workloads — but closed-book factual recall collapses: the published narrow-calibrated REAPs answer Lexington for the capital of Kentucky and loop on Marbury v. Madison. The cause is not the model and not the serving stack — it's a calibration corpus that excludes knowledge. This checkpoint re-runs the REAP saliency on a knowledge- and legal-inclusive, axis-balanced calibration.
| Prompt | Narrow-REAP baseline | This checkpoint (N=172) |
|---|---|---|
| What is the capital of Kentucky? | Lexington | Frankfort |
| In one sentence, what did Marbury v. Madison establish? | (empty / repetition loop) | judicial review — the Supreme Court's authority to declare laws unconstitutional |
| What is the capital of Texas? | — | Austin |
All reasoning prompts still pass (8/8 traps; pipes/syllogism/discount/anticipatory-repudiation all finish=stop).
What I did specifically
- Diagnosed it as calibration, not the model — A/B'd kernels, MTP, DCP, image, sampling. Same prompts, same parser, same image: behavior tracked the calibration corpus, not anything else.
- Built a 4-axis balanced calibration — 12,228 samples, 3,057 per axis, max 16,384 tokens, no truncation, no packing:
- Axis 1 — General knowledge: C4, Wikipedia, MMLU-aux, TriviaQA, Natural Questions
- Axis 2 — Legal: 1,528 CAP markdown cases + a live Neo4j legal KG (300 headnotes, 390 statutes, 373 case summaries, 113 fact-atoms, 353 worked-examples).
fallback_used: false. - Axis 3 — Code/agentic: evol-codealpaca, Magicoder, xLAM, SWE-smith
- Axis 4 — Reasoning/termination: terminating
<think>…</think>traces,</think>-region weighted ×6
- Built a real block-wise NVFP4 saliency runner — the published REAP loader can't consume
glm_moe_dsa + modelopt NVFP4, and a whole-model vLLM load OOMs the intact 435 GB before any forward. The runner chunks decoder layers into VRAM, dequants NVFP4 → BF16 in place, runs the GlmMoeDsaNaiveMoe modules (explicit per-expert outputs — the ideal saliency hook), capturesS_j = mean_{active tokens}(router_gate_j · ||expert_output_j||₂), frees the chunk. Real GPU saliency over 7,368,253 active tokens across 75 MoE layers — no static proxy. - Self-consistent prune at prune time — kept experts renumbered contiguous 0…171, router shrunk to
[172, 6144], bias to[172],n_routed_experts = num_experts = 172. Loads clean on stock vLLM with norepair_reap.py.
Quick start
# Pull serving image
docker pull verdictai/glm52-nvfp4-dcpmtp:v3.3
# Download model (294 GB)
huggingface-cli download brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 \
--local-dir $HOME/models/GLM-5.2-NVFP4-REAP-Recall-N172
# Serve on 4x RTX PRO 6000 96GB (sm120)
docker run -d --name glm52-reap-recall \
--gpus all --runtime nvidia --ipc host --shm-size 32g --network host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v "$HOME/models":/models-archive:ro -v "$HOME/.cache/glm52-b12x":/cache \
-e CUDA_VISIBLE_DEVICES=0,1,2,3 -e CUDA_DEVICE_ORDER=PCI_BUS_ID -e CUTE_DSL_ARCH=sm_120a \
-e HF_HUB_OFFLINE=1 -e NCCL_IB_DISABLE=1 -e NCCL_P2P_LEVEL=SYS -e NCCL_PROTO=LL,LL128,Simple \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-e VLLM_USE_AOT_COMPILE=1 -e VLLM_USE_BREAKABLE_CUDAGRAPH=0 -e VLLM_USE_FLASHINFER_SAMPLER=1 \
-e B12X_MHC_MAX_TOKENS=16384 -e VLLM_USE_B12X_WO_PROJECTION=1 -e VLLM_USE_B12X_MHC=1 \
-e VLLM_USE_B12X_FP8_GEMM=1 -e VLLM_USE_B12X_MOE=1 -e VLLM_USE_B12X_SPARSE_INDEXER=1 \
-e VLLM_USE_V2_MODEL_RUNNER=1 -e VLLM_USE_FUSED_MOE_GROUPED_TOPK=1 \
-e VLLM_PCIE_ALLREDUCE_BACKEND=b12x -e VLLM_ENABLE_PCIE_ALLREDUCE=1 \
-e B12X_MLA_SM120_UNIFIED=1 -e USES_B12X=True -e B12X_DENSE_SPLITK_TURBO=1 -e B12X_W4A16_TC_DECODE=1 \
-e B12X_MOE_FORCE_A16=1 -e VLLM_DCP_GLOBAL_TOPK=1 -e VLLM_DCP_SHARD_DRAFT=1 \
verdictai/glm52-nvfp4-dcpmtp:v3.3 \
python -m vllm.entrypoints.cli.main serve /models-archive/GLM-5.2-NVFP4-REAP-Recall-N172 \
--served-model-name glm-5.2-nvfp4 --host 0.0.0.0 --port 9405 \
--kv-cache-dtype fp8 --block-size 256 --load-format safetensors \
--tensor-parallel-size 4 --decode-context-parallel-size 4 --moe-backend b12x --linear-backend auto \
--gpu-memory-utilization 0.92 --max-model-len 200000 --max-num-seqs 16 \
--enable-chunked-prefill --enable-prefix-caching --max-num-batched-tokens 8192 \
--max-cudagraph-capture-size 64 --attention-backend B12X_MLA_SPARSE \
--compilation-config '{"custom_ops":["all"],"cudagraph_mode":"PIECEWISE"}' \
--enable-flashinfer-autotune \
--hf-overrides '{"use_index_cache":true,"index_topk_pattern":"FFFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSS"}' \
--reasoning-parser glm45 --tool-call-parser glm47 --enable-auto-tool-choice \
--speculative-config '{"method":"mtp","num_speculative_tokens":5,"draft_sample_method":"probabilistic","moe_backend":"b12x","use_local_argmax_reduction":true}'
# Sanity check
curl -s http://127.0.0.1:9405/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"glm-5.2-nvfp4","messages":[{"role":"user","content":"what is the capital of kentucky?"}]}'
Optional —
thinking_token_budget(hard-caps the reasoning loop): mount the four V2 patch files from the repo into the container and add--reasoning-config '{"reasoning_start_str":"<think>","reasoning_end_str":"</think>"}', then pass"thinking_token_budget": Nin the request body. Without those patches, leave--reasoning-configoff — on a plain image it clobbersglm45's<think>priming and the chat path stops thinking.
Verified serving config (4× RTX PRO 6000 96GB, TP4)
The v3.3 serving config (image verdictai/glm52-nvfp4-dcpmtp:v3.3):
| Knob | Value | Notes |
|---|---|---|
| Tensor parallel | 4 | 4× 96 GB |
| Decode context parallel | 4 | the >300K KV path on TP4 |
| GPU memory util | 0.92 | DCP4 + MTP headroom (GPU1 also carries ~6.7 GB display) |
| Max model len | 200,000 | MTP + DCP4 on fp8; raise with no-MTP |
| Max num batched tokens | 8192 | vLLM warns 2048 is suboptimal; 8192 fits the KV pool |
| KV cache dtype | fp8 | fp8 MLA KV on the b12x sparse path |
| MTP num_speculative_tokens | 5 | GLM-5.2 was trained for 5-token MTP (official recipe) |
B12X_MOE_FORCE_A16=1 |
required | w4a4 accumulates error past ~1–2K gen tokens |
VLLM_DCP_GLOBAL_TOPK=1 |
required | remaps each shard's local top-k to true global selection (DCP > 1) |
VLLM_DCP_SHARD_DRAFT=1 |
recommended | shards the MTP draft KV across DCP ranks instead of replicating |
-cc.cudagraph_mode=PIECEWISE |
required for long context | CuTe-DSL JIT inside FULL cudagraph capture deadlocks at first decode after long prefill (>100K). PIECEWISE breaks the graph at the indexer so JIT happens outside capture. Must use the CLI shortcut form (JSON drops to None). |
| Reasoning parser | glm45 |
leave --reasoning-config off unless using thinking_token_budget (see above) |
Benchmarks (measured on 4× RTX PRO 6000 96GB)
Note: the numbers below were measured on the prior
nvfp4_ds_mla(4-bit MLA KV) config. Av3.3(fp8) re-bench is pending.
Benchmarked with local-inference-lab/llm-inference-bench, concurrency × context grid. 15/15 cells completed.
Decode throughput (tok/s)
| ctx | C=1 | C=2 | C=4 | agg @ C=4 |
|---|---|---|---|---|
| 0 | 80.7 | 45.4 | 40.4 | 161.8 |
| 16K | 65.6 | 46.3 | 38.2 | 152.7 |
| 32K | 52.1 | 43.2 | 34.2 | 136.7 |
| 64K | 50.9 | 36.9 | 34.1 | 136.2 |
| 128K | 82.6 | 63.0 | 54.7 | 218.6 |
Real-prompt single-user (Marbury essay × 5, thinking OFF)
| run | tokens | TTFT | decode tok/s |
|---|---|---|---|
| 1 | 1500 | 0.13s | 58.94 |
| 2 | 1500 | 1.06s | 59.93 |
| 3 | 1500 | 0.14s | 57.52 |
| 4 | 1500 | 0.15s | 57.17 |
| 5 | 1500 | 0.15s | 58.70 |
| avg | 0.33s | 58.45 |
Prefill (full ingest)
| ctx | tokens | TTFT | tok/s |
|---|---|---|---|
| 8K | 8,199 | 4.66s | 1,761 |
| 16K | 16,228 | 10.87s | 1,493 |
| 32K | 32,321 | 20.43s | 1,582 |
| 64K | 64,513 | 42.27s | 1,526 |
| 128K | 128,887 | 87.26s | 1,477 |
KV pool: 542,857 tokens · * VRAM peak: 97.93%
Files
- 87 safetensors shards (294 GB),
config.json,model.safetensors.index.json, tokenizer files,generation_config.json,chat_template.jinja reap_recall_keep_map_with_scores.json— per-expert real saliency scores and the kept-expert map (the replication artifact)REAP_RECALL_VERDICT.md— full corpus / saliency / validation ledgerserve_glm52_reap_recall.sh— verified launch script
Sampling
{
"temperature": 1.0,
"top_p": 0.95,
"repetition_penalty": 1.05,
"stop_token_ids": [154820, 154827, 154829],
"chat_template_kwargs": {"enable_thinking": true, "reasoning_effort": "high"}
}
For non-thinking generation, set enable_thinking: false. For Marbury-style essays, max_tokens >= 1500 recommended (thinking-high needs room).
License
MIT. Attribution to the sources is the only ask:
- z.ai (GLM-5.2 base)
- Luke Alonso (NVFP4 quantization of GLM-5.2)
- Cerebras Research (REAP method, arXiv:2510.13999)
@inproceedings{lasby2026reap,
title={{REAP} the Experts: Why Pruning Prevails for One-Shot MoE compression},
author={Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=ukGxWd2aDG}
}
- Downloads last month
- 1,999
Model tree for brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172
Base model
zai-org/GLM-5.2
Install from pip and serve model
# Install vLLM from pip: pip install vllm# Start the vLLM server: vllm serve "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172"# Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'