festr2/glm51-nvfp4-w4a16-fp8pbwo-l51-62-20260517

GLM-5.1 mixed precision checkpoint for vLLM experiments.

  • Base checkpoint: lukealonso/GLM-5.1-NVFP4
  • Selected MoE expert layers replaced with FP8 blockwise weight-only (FP8_PB_WO) tensors from zai-org/GLM-5.1-FP8.
  • Remaining MoE expert layers stay NVFP4 and are intended to run with the B12X NVFP4 backend.
  • Selected FP8_PB_WO layers: 51-62
  • Quantization mode in config.json: modelopt / MIXED_PRECISION
  • Expected runtime mode used in validation: W4A16 (B12X_MOE_FORCE_A16=1)

Validation summary against FP8 reference logits:

windows KLD
8 0.049443
42 0.056918

Runtime footprint observed in vLLM KLD runner:

model memory / rank KV cache tokens
45.30 GiB 679872

Canonical vLLM support commit: 61eac2779 Support ModelOpt mixed FP8_PB_WO MoE

Downloads last month
7
Safetensors
Model size
488B params
Tensor type
F32
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for festr2/glm51-nvfp4-w4a16-fp8pbwo-l51-62-20260517

Base model

zai-org/GLM-5.1
Quantized
(6)
this model