festr2/glm51-nvfp4-w4a16-fp8pbwo-l51-62-20260517
GLM-5.1 mixed precision checkpoint for vLLM experiments.
- Base checkpoint:
lukealonso/GLM-5.1-NVFP4 - Selected MoE expert layers replaced with FP8 blockwise weight-only (
FP8_PB_WO) tensors fromzai-org/GLM-5.1-FP8. - Remaining MoE expert layers stay NVFP4 and are intended to run with the B12X NVFP4 backend.
- Selected FP8_PB_WO layers:
51-62 - Quantization mode in
config.json:modelopt/MIXED_PRECISION - Expected runtime mode used in validation: W4A16 (
B12X_MOE_FORCE_A16=1)
Validation summary against FP8 reference logits:
| windows | KLD |
|---|---|
| 8 | 0.049443 |
| 42 | 0.056918 |
Runtime footprint observed in vLLM KLD runner:
| model memory / rank | KV cache tokens |
|---|---|
| 45.30 GiB | 679872 |
Canonical vLLM support commit:
61eac2779 Support ModelOpt mixed FP8_PB_WO MoE
- Downloads last month
- 7
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support