Instructions to use brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172")
model = AutoModelForCausalLM.from_pretrained("brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172

SGLang

How to use brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 with Docker Model Runner:
```
docker model run hf.co/brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172
```

GLM-5.2-NVFP4-REAP-Recall (N=172)

A recall-recovered re-REAP of GLM-5.2-NVFP4 — 172 experts/layer (from 256), self-consistent, BF16 weights preserved, NVFP4 experts sliced byte-for-byte.

Recipe + serving stack: github.com/brandonmmusic-max/GLM-5.2-Reap
Docker image: verdictai/glm52-nvfp4-dcpmtp:v3.3
Saliency methodology + corpus ledger: REAP_RECALL_VERDICT.md
Verified launch script: serve_glm52_reap_recall.sh

Attribution: GLM-5.2 by z.ai · NVFP4 quantization by Luke Alonso · REAP by Cerebras Research (arXiv:2510.13999, ICLR 2026).

Why this exists

Standard REAP compresses GLM-5.2 well for code, agentic, and reasoning workloads — but closed-book factual recall collapses: the published narrow-calibrated REAPs answer Lexington for the capital of Kentucky and loop on Marbury v. Madison. The cause is not the model and not the serving stack — it's a calibration corpus that excludes knowledge. This checkpoint re-runs the REAP saliency on a knowledge- and legal-inclusive, axis-balanced calibration.

Prompt	Narrow-REAP baseline	This checkpoint (N=172)
What is the capital of Kentucky?	Lexington	Frankfort
In one sentence, what did Marbury v. Madison establish?	(empty / repetition loop)	judicial review — the Supreme Court's authority to declare laws unconstitutional
What is the capital of Texas?	—	Austin

All reasoning prompts still pass (8/8 traps; pipes/syllogism/discount/anticipatory-repudiation all finish=stop).

What I did specifically

Diagnosed it as calibration, not the model — A/B'd kernels, MTP, DCP, image, sampling. Same prompts, same parser, same image: behavior tracked the calibration corpus, not anything else.
Built a 4-axis balanced calibration — 12,228 samples, 3,057 per axis, max 16,384 tokens, no truncation, no packing:
- Axis 1 — General knowledge: C4, Wikipedia, MMLU-aux, TriviaQA, Natural Questions
- Axis 2 — Legal: 1,528 CAP markdown cases + a live Neo4j legal KG (300 headnotes, 390 statutes, 373 case summaries, 113 fact-atoms, 353 worked-examples). fallback_used: false.
- Axis 3 — Code/agentic: evol-codealpaca, Magicoder, xLAM, SWE-smith
- Axis 4 — Reasoning/termination: terminating <think>…</think> traces, </think>-region weighted ×6
Built a real block-wise NVFP4 saliency runner — the published REAP loader can't consume glm_moe_dsa + modelopt NVFP4, and a whole-model vLLM load OOMs the intact 435 GB before any forward. The runner chunks decoder layers into VRAM, dequants NVFP4 → BF16 in place, runs the GlmMoeDsaNaiveMoe modules (explicit per-expert outputs — the ideal saliency hook), captures S_j = mean_{active tokens}(router_gate_j · ||expert_output_j||₂), frees the chunk. Real GPU saliency over 7,368,253 active tokens across 75 MoE layers — no static proxy.
Self-consistent prune at prune time — kept experts renumbered contiguous 0…171, router shrunk to [172, 6144], bias to [172], n_routed_experts = num_experts = 172. Loads clean on stock vLLM with no repair_reap.py.

Quick start

# Pull serving image
docker pull verdictai/glm52-nvfp4-dcpmtp:v3.3

# Download model (294 GB)
huggingface-cli download brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172 \
  --local-dir $HOME/models/GLM-5.2-NVFP4-REAP-Recall-N172

# Serve on 4x RTX PRO 6000 96GB (sm120)
docker run -d --name glm52-reap-recall \
  --gpus all --runtime nvidia --ipc host --shm-size 32g --network host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v "$HOME/models":/models-archive:ro -v "$HOME/.cache/glm52-b12x":/cache \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3 -e CUDA_DEVICE_ORDER=PCI_BUS_ID -e CUTE_DSL_ARCH=sm_120a \
  -e HF_HUB_OFFLINE=1 -e NCCL_IB_DISABLE=1 -e NCCL_P2P_LEVEL=SYS -e NCCL_PROTO=LL,LL128,Simple \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -e VLLM_USE_AOT_COMPILE=1 -e VLLM_USE_BREAKABLE_CUDAGRAPH=0 -e VLLM_USE_FLASHINFER_SAMPLER=1 \
  -e B12X_MHC_MAX_TOKENS=16384 -e VLLM_USE_B12X_WO_PROJECTION=1 -e VLLM_USE_B12X_MHC=1 \
  -e VLLM_USE_B12X_FP8_GEMM=1 -e VLLM_USE_B12X_MOE=1 -e VLLM_USE_B12X_SPARSE_INDEXER=1 \
  -e VLLM_USE_V2_MODEL_RUNNER=1 -e VLLM_USE_FUSED_MOE_GROUPED_TOPK=1 \
  -e VLLM_PCIE_ALLREDUCE_BACKEND=b12x -e VLLM_ENABLE_PCIE_ALLREDUCE=1 \
  -e B12X_MLA_SM120_UNIFIED=1 -e USES_B12X=True -e B12X_DENSE_SPLITK_TURBO=1 -e B12X_W4A16_TC_DECODE=1 \
  -e B12X_MOE_FORCE_A16=1 -e VLLM_DCP_GLOBAL_TOPK=1 -e VLLM_DCP_SHARD_DRAFT=1 \
  verdictai/glm52-nvfp4-dcpmtp:v3.3 \
  python -m vllm.entrypoints.cli.main serve /models-archive/GLM-5.2-NVFP4-REAP-Recall-N172 \
    --served-model-name glm-5.2-nvfp4 --host 0.0.0.0 --port 9405 \
    --kv-cache-dtype fp8 --block-size 256 --load-format safetensors \
    --tensor-parallel-size 4 --decode-context-parallel-size 4 --moe-backend b12x --linear-backend auto \
    --gpu-memory-utilization 0.92 --max-model-len 200000 --max-num-seqs 16 \
    --enable-chunked-prefill --enable-prefix-caching --max-num-batched-tokens 8192 \
    --max-cudagraph-capture-size 64 --attention-backend B12X_MLA_SPARSE \
    --compilation-config '{"custom_ops":["all"],"cudagraph_mode":"PIECEWISE"}' \
    --enable-flashinfer-autotune \
    --hf-overrides '{"use_index_cache":true,"index_topk_pattern":"FFFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSSFSSS"}' \
    --reasoning-parser glm45 --tool-call-parser glm47 --enable-auto-tool-choice \
    --speculative-config '{"method":"mtp","num_speculative_tokens":5,"draft_sample_method":"probabilistic","moe_backend":"b12x","use_local_argmax_reduction":true}'

# Sanity check
curl -s http://127.0.0.1:9405/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"glm-5.2-nvfp4","messages":[{"role":"user","content":"what is the capital of kentucky?"}]}'

Optional — thinking_token_budget (hard-caps the reasoning loop): mount the four V2 patch files from the repo into the container and add --reasoning-config '{"reasoning_start_str":"<think>","reasoning_end_str":"</think>"}', then pass "thinking_token_budget": N in the request body. Without those patches, leave --reasoning-config off — on a plain image it clobbers glm45's <think> priming and the chat path stops thinking.

Verified serving config (4× RTX PRO 6000 96GB, TP4)

The v3.3 serving config (image verdictai/glm52-nvfp4-dcpmtp:v3.3):

Knob	Value	Notes
Tensor parallel	4	4× 96 GB
Decode context parallel	4	the >300K KV path on TP4
GPU memory util	0.92	DCP4 + MTP headroom (GPU1 also carries ~6.7 GB display)
Max model len	200,000	MTP + DCP4 on fp8; raise with no-MTP
Max num batched tokens	8192	vLLM warns 2048 is suboptimal; 8192 fits the KV pool
KV cache dtype	fp8	fp8 MLA KV on the b12x sparse path
MTP num_speculative_tokens	5	GLM-5.2 was trained for 5-token MTP (official recipe)
`B12X_MOE_FORCE_A16=1`	required	w4a4 accumulates error past ~1–2K gen tokens
`VLLM_DCP_GLOBAL_TOPK=1`	required	remaps each shard's local top-k to true global selection (DCP > 1)
`VLLM_DCP_SHARD_DRAFT=1`	recommended	shards the MTP draft KV across DCP ranks instead of replicating
`-cc.cudagraph_mode=PIECEWISE`	required for long context	CuTe-DSL JIT inside FULL cudagraph capture deadlocks at first decode after long prefill (>100K). PIECEWISE breaks the graph at the indexer so JIT happens outside capture. Must use the CLI shortcut form (JSON drops to None).
Reasoning parser	`glm45`	leave `--reasoning-config` off unless using `thinking_token_budget` (see above)

Benchmarks (measured on 4× RTX PRO 6000 96GB)

Note: the numbers below were measured on the prior nvfp4_ds_mla (4-bit MLA KV) config. A v3.3 (fp8) re-bench is pending.

Benchmarked with local-inference-lab/llm-inference-bench, concurrency × context grid. 15/15 cells completed.

Decode throughput (tok/s)

ctx	C=1	C=2	C=4	agg @ C=4
0	80.7	45.4	40.4	161.8
16K	65.6	46.3	38.2	152.7
32K	52.1	43.2	34.2	136.7
64K	50.9	36.9	34.1	136.2
128K	82.6	63.0	54.7	218.6

Real-prompt single-user (Marbury essay × 5, thinking OFF)

run	tokens	TTFT	decode tok/s
1	1500	0.13s	58.94
2	1500	1.06s	59.93
3	1500	0.14s	57.52
4	1500	0.15s	57.17
5	1500	0.15s	58.70
avg		0.33s	58.45

Prefill (full ingest)

ctx	tokens	TTFT	tok/s
8K	8,199	4.66s	1,761
16K	16,228	10.87s	1,493
32K	32,321	20.43s	1,582
64K	64,513	42.27s	1,526
128K	128,887	87.26s	1,477

KV pool: 542,857 tokens · * VRAM peak: 97.93%

Files

87 safetensors shards (294 GB), config.json, model.safetensors.index.json, tokenizer files, generation_config.json, chat_template.jinja
reap_recall_keep_map_with_scores.json — per-expert real saliency scores and the kept-expert map (the replication artifact)
REAP_RECALL_VERDICT.md — full corpus / saliency / validation ledger
serve_glm52_reap_recall.sh — verified launch script

Sampling

{
  "temperature": 1.0,
  "top_p": 0.95,
  "repetition_penalty": 1.05,
  "stop_token_ids": [154820, 154827, 154829],
  "chat_template_kwargs": {"enable_thinking": true, "reasoning_effort": "high"}
}

For non-thinking generation, set enable_thinking: false. For Marbury-style essays, max_tokens >= 1500 recommended (thinking-high needs room).

License

MIT. Attribution to the sources is the only ask:

z.ai (GLM-5.2 base)
Luke Alonso (NVFP4 quantization of GLM-5.2)
Cerebras Research (REAP method, arXiv:2510.13999)

@inproceedings{lasby2026reap,
  title={{REAP} the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=ukGxWd2aDG}
}

Downloads last month: 1,999

Safetensors

Model size

296B params

Tensor type

BF16

F8_E4M3

F32

Model tree for brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172

Base model

zai-org/GLM-5.2

Quantized

(74)

this model

Paper for brandonmusic/GLM-5.2-NVFP4-REAP-Recall-N172

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 20