festr2/glm51-nvfp4-w4a16-fp8pbwo-l51-62-20260517

GLM-5.1 mixed precision checkpoint for vLLM experiments.

Base checkpoint: lukealonso/GLM-5.1-NVFP4
Selected MoE expert layers replaced with FP8 blockwise weight-only (FP8_PB_WO) tensors from zai-org/GLM-5.1-FP8.
Remaining MoE expert layers stay NVFP4 and are intended to run with the B12X NVFP4 backend.
Selected FP8_PB_WO layers: 51-62
Quantization mode in config.json: modelopt / MIXED_PRECISION
Expected runtime mode used in validation: W4A16 (B12X_MOE_FORCE_A16=1)

Validation summary against FP8 reference logits:

windows	KLD
8	0.049443
42	0.056918

Runtime footprint observed in vLLM KLD runner:

model memory / rank	KV cache tokens
45.30 GiB	679872

Canonical vLLM support commit: 61eac2779 Support ModelOpt mixed FP8_PB_WO MoE

Safetensors

Model size

488B params

Tensor type

F32

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for festr2/glm51-nvfp4-w4a16-fp8pbwo-l51-62-20260517

Base model

Quantized

Quantized

(6)

this model