How to use from
SGLang
Install from pip and serve model
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'
Use Docker images
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'
Quick Links

GLM-5.2-NVFP4-REAP-Recall (N=172)

A recall-recovered re-REAP of GLM-5.2-NVFP4 — 172 experts/layer (from 256), self-consistent, BF16 weights preserved, NVFP4 experts sliced byte-for-byte.

Attribution: GLM-5.2 by z.ai · NVFP4 quantization by Luke Alonso · REAP by Cerebras Research (arXiv:2510.13999, ICLR 2026).


Why this exists

Standard REAP compresses GLM-5.2 well for code, agentic, and reasoning workloads — but closed-book factual recall collapses: the published narrow-calibrated REAPs answer Lexington for the capital of Kentucky and loop on Marbury v. Madison. The cause is not the model and not the serving stack — it's a calibration corpus that excludes knowledge. This checkpoint re-runs the REAP saliency on a knowledge- and legal-inclusive, axis-balanced calibration.

Prompt Narrow-REAP baseline This checkpoint (N=172)
What is the capital of Kentucky? Lexington Frankfort
In one sentence, what did Marbury v. Madison establish? (empty / repetition loop) judicial review — the Supreme Court's authority to declare laws unconstitutional
What is the capital of Texas? Austin

All reasoning prompts still pass (8/8 traps; pipes/syllogism/discount/anticipatory-repudiation all finish=stop).


What I did specifically

  1. Diagnosed it as calibration, not the model — A/B'd kernels, MTP, DCP, image, sampling. Same prompts, same parser, same image: behavior tracked the calibration corpus, not anything else.
  2. Built a 4-axis balanced calibration12,228 samples, 3,057 per axis, max 16,384 tokens, no truncation, no packing:
    • Axis 1 — General knowledge: C4, Wikipedia, MMLU-aux, TriviaQA, Natural Questions
    • Axis 2 — Legal: 1,528 CAP markdown cases + a live Neo4j legal KG (300 headnotes, 390 statutes, 373 case summaries, 113 fact-atoms, 353 worked-examples). fallback_used: false.
    • Axis 3 — Code/agentic: evol-codealpaca, Magicoder, xLAM, SWE-smith
    • Axis 4 — Reasoning/termination: terminating <think>…</think> traces, </think>-region weighted ×6
  3. Built a real block-wise NVFP4 saliency runner — the published REAP loader can't consume glm_moe_dsa + modelopt NVFP4, and a whole-model vLLM load OOMs the intact 435 GB before any forward. The runner chunks decoder layers into VRAM, dequants NVFP4 → BF16 in place, runs the GlmMoeDsaNaiveMoe modules (explicit per-expert outputs — the ideal saliency hook), captures S_j = mean_{active tokens}(router_gate_j · ||expert_output_j||₂), frees the chunk. Real GPU saliency over 7,368,253 active tokens across 75 MoE layers — no static proxy.
  4. Self-consistent prune at prune time — kept experts renumbered contiguous 0…171, router shrunk to [172, 6144], bias to [172], n_routed_experts = num_experts = 172. Loads clean on stock vLLM with no repair_reap.py.

Quick start

# Pull serving image
docker pull verdictai/glm52-nvfp4-dcpmtp:v3.3

# Download model (294 GB)
huggingface-cli download brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 \
  --local-dir $HOME/models/GLM-5.2-NVFP4-REAP-Recall-N172

# Serve on 4x RTX PRO 6000 96GB (sm120)
docker run -d --name glm52-reap-recall \
  --gpus all --runtime nvidia --ipc host --shm-size 32g --network host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v "$HOME/models":/models-archive:ro -v "$HOME/.cache/glm52-b12x":/cache \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3 -e CUDA_DEVICE_ORDER=PCI_BUS_ID -e CUTE_DSL_ARCH=sm_120a \
  -e HF_HUB_OFFLINE=1 -e NCCL_IB_DISABLE=1 -e NCCL_P2P_LEVEL=SYS -e NCCL_PROTO=LL,LL128,Simple \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -e VLLM_USE_AOT_COMPILE=1 -e VLLM_USE_BREAKABLE_CUDAGRAPH=0 -e VLLM_USE_FLASHINFER_SAMPLER=1 \
  -e B12X_MHC_MAX_TOKENS=16384 -e VLLM_USE_B12X_WO_PROJECTION=1 -e VLLM_USE_B12X_MHC=1 \
  -e VLLM_USE_B12X_FP8_GEMM=1 -e VLLM_USE_B12X_MOE=1 -e VLLM_USE_B12X_SPARSE_INDEXER=1 \
  -e VLLM_USE_V2_MODEL_RUNNER=1 -e VLLM_USE_FUSED_MOE_GROUPED_TOPK=1 \
  -e VLLM_PCIE_ALLREDUCE_BACKEND=b12x -e VLLM_ENABLE_PCIE_ALLREDUCE=1 \
  -e B12X_MLA_SM120_UNIFIED=1 -e USES_B12X=True -e B12X_DENSE_SPLITK_TURBO=1 -e B12X_W4A16_TC_DECODE=1 \
  -e B12X_MOE_FORCE_A16=1 -e VLLM_DCP_GLOBAL_TOPK=1 -e VLLM_DCP_SHARD_DRAFT=1 \
  verdictai/glm52-nvfp4-dcpmtp:v3.3 \
  python -m vllm.entrypoints.cli.main serve /models-archive/GLM-5.2-NVFP4-REAP-Recall-N172 \
    --served-model-name glm-5.2-nvfp4 --host 0.0.0.0 --port 9405 \
    --kv-cache-dtype fp8 --block-size 256 --load-format safetensors \
    --tensor-parallel-size 4 --decode-context-parallel-size 4 --moe-backend b12x --linear-backend auto \
    --gpu-memory-utilization 0.92 --max-model-len 200000 --max-num-seqs 16 \
    --enable-chunked-prefill --enable-prefix-caching --max-num-batched-tokens 8192 \
    --max-cudagraph-capture-size 64 --attention-backend B12X_MLA_SPARSE \
    --compilation-config '{"custom_ops":["all"],"cudagraph_mode":"PIECEWISE"}' \
    --enable-flashinfer-autotune \
    --hf-overrides '{"use_index_cache":true,"index_topk_pattern":"FFFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSS"}' \
    --reasoning-parser glm45 --tool-call-parser glm47 --enable-auto-tool-choice \
    --speculative-config '{"method":"mtp","num_speculative_tokens":5,"draft_sample_method":"probabilistic","moe_backend":"b12x","use_local_argmax_reduction":true}'

# Sanity check
curl -s http://127.0.0.1:9405/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"glm-5.2-nvfp4","messages":[{"role":"user","content":"what is the capital of kentucky?"}]}'

Optional — thinking_token_budget (hard-caps the reasoning loop): mount the four V2 patch files from the repo into the container and add --reasoning-config '{"reasoning_start_str":"<think>","reasoning_end_str":"</think>"}', then pass "thinking_token_budget": N in the request body. Without those patches, leave --reasoning-config off — on a plain image it clobbers glm45's <think> priming and the chat path stops thinking.


Verified serving config (4× RTX PRO 6000 96GB, TP4)

The v3.3 serving config (image verdictai/glm52-nvfp4-dcpmtp:v3.3):

Knob Value Notes
Tensor parallel 4 4× 96 GB
Decode context parallel 4 the >300K KV path on TP4
GPU memory util 0.92 DCP4 + MTP headroom (GPU1 also carries ~6.7 GB display)
Max model len 200,000 MTP + DCP4 on fp8; raise with no-MTP
Max num batched tokens 8192 vLLM warns 2048 is suboptimal; 8192 fits the KV pool
KV cache dtype fp8 fp8 MLA KV on the b12x sparse path
MTP num_speculative_tokens 5 GLM-5.2 was trained for 5-token MTP (official recipe)
B12X_MOE_FORCE_A16=1 required w4a4 accumulates error past ~1–2K gen tokens
VLLM_DCP_GLOBAL_TOPK=1 required remaps each shard's local top-k to true global selection (DCP > 1)
VLLM_DCP_SHARD_DRAFT=1 recommended shards the MTP draft KV across DCP ranks instead of replicating
-cc.cudagraph_mode=PIECEWISE required for long context CuTe-DSL JIT inside FULL cudagraph capture deadlocks at first decode after long prefill (>100K). PIECEWISE breaks the graph at the indexer so JIT happens outside capture. Must use the CLI shortcut form (JSON drops to None).
Reasoning parser glm45 leave --reasoning-config off unless using thinking_token_budget (see above)

Benchmarks (measured on 4× RTX PRO 6000 96GB)

Note: the numbers below were measured on the prior nvfp4_ds_mla (4-bit MLA KV) config. A v3.3 (fp8) re-bench is pending.

Benchmarked with local-inference-lab/llm-inference-bench, concurrency × context grid. 15/15 cells completed.

Decode throughput (tok/s)

ctx C=1 C=2 C=4 agg @ C=4
0 80.7 45.4 40.4 161.8
16K 65.6 46.3 38.2 152.7
32K 52.1 43.2 34.2 136.7
64K 50.9 36.9 34.1 136.2
128K 82.6 63.0 54.7 218.6

Real-prompt single-user (Marbury essay × 5, thinking OFF)

run tokens TTFT decode tok/s
1 1500 0.13s 58.94
2 1500 1.06s 59.93
3 1500 0.14s 57.52
4 1500 0.15s 57.17
5 1500 0.15s 58.70
avg 0.33s 58.45

Prefill (full ingest)

ctx tokens TTFT tok/s
8K 8,199 4.66s 1,761
16K 16,228 10.87s 1,493
32K 32,321 20.43s 1,582
64K 64,513 42.27s 1,526
128K 128,887 87.26s 1,477

KV pool: 542,857 tokens · * VRAM peak: 97.93%


Files

  • 87 safetensors shards (294 GB), config.json, model.safetensors.index.json, tokenizer files, generation_config.json, chat_template.jinja
  • reap_recall_keep_map_with_scores.json — per-expert real saliency scores and the kept-expert map (the replication artifact)
  • REAP_RECALL_VERDICT.md — full corpus / saliency / validation ledger
  • serve_glm52_reap_recall.sh — verified launch script

Sampling

{
  "temperature": 1.0,
  "top_p": 0.95,
  "repetition_penalty": 1.05,
  "stop_token_ids": [154820, 154827, 154829],
  "chat_template_kwargs": {"enable_thinking": true, "reasoning_effort": "high"}
}

For non-thinking generation, set enable_thinking: false. For Marbury-style essays, max_tokens >= 1500 recommended (thinking-high needs room).


License

MIT. Attribution to the sources is the only ask:

  • z.ai (GLM-5.2 base)
  • Luke Alonso (NVFP4 quantization of GLM-5.2)
  • Cerebras Research (REAP method, arXiv:2510.13999)
@inproceedings{lasby2026reap,
  title={{REAP} the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=ukGxWd2aDG}
}
Downloads last month
1,999
Safetensors
Model size
296B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172

Base model

zai-org/GLM-5.2
Quantized
(74)
this model

Paper for brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172