Support this work → · X · GitHub · REAP paper · Cerebras REAP

DeepSeek-V3.2-508B-NVFP4

NVFP4 quantization of cerebras/DeepSeek-V3.2-REAP-508B-A37B.

At a glance

Base model cerebras/DeepSeek-V3.2-REAP-508B-A37B
Format NVFP4
Total params 508B
Active / token 37B
Experts / layer 192
Layers 61
Hidden size 7168
Context 163,840
On-disk size 288 GB

Which variant should I pick?

Variant Format Link
DeepSeek-V3.2-345B-W3A16 W3A16 link
DeepSeek-V3.2-508B-NVFP4 (this) NVFP4 link

𓌳 REAP 𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression

📄 Paper💻 Code

DeepSeek-V3.2-REAP-508B-A37B-NVFP4

REAP-pruned + NVFP4 quantized DeepSeek-V3.2 for efficient deployment on NVIDIA Blackwell (sm120).

This is the first publicly available NVFP4-quantized variant of the 508B-parameter REAP-pruned DeepSeek-V3.2, targeting 8x RTX PRO 6000 Blackwell 96GB deployments via sglang.

📋 Model Specifications

Property Value
Base Model cerebras/DeepSeek-V3.2-REAP-508B-A37B (REAP-pruned from deepseek-ai/DeepSeek-V3.2)
Architecture DeepseekV3ForCausalLM (MoE with MLA)
Params 508B total, ~37B active per token (top-8 of 384 routed + 1 shared)
Base precision BF16 (source: ~1.0 TB)
Quantization NVFP4 (4-bit weights + FP8 per-group scales, group=16)
Output size 288 GB (~3.6x compression)
Experts per MoE layer 384 routed + 1 shared
Layers 61
Hidden size 7168
Format nvfp4-pack-quantized via compressed-tensors

🚀 Deploy on sm120 (RTX PRO 6000 Blackwell)

Uses pre-built voipmonitor/sglang:cu130 Docker image with all sm120 patches applied.

docker run --gpus all --ipc=host --shm-size=8g --network=host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v jit-cache:/cache/jit \
  -e SGLANG_ENABLE_SPEC_V2=True \
  -e SGLANG_ENABLE_JIT_DEEPGEMM=0 \
  -e SGLANG_ENABLE_DEEP_GEMM=0 \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_P2P_LEVEL=SYS \
  -e NCCL_MIN_NCHANNELS=8 \
  voipmonitor/sglang:cu130 \
  python3 -m sglang.launch_server \
    --model-path 0xSero/DeepSeek-V3.2-508B-NVFP4 \
    --served-model-name deepseek-v32-reap-nvfp4 \
    --tensor-parallel-size 8 \
    --quantization modelopt_fp4 \
    --kv-cache-dtype bf16 \
    --trust-remote-code \
    --attention-backend flashinfer \
    --moe-runner-backend b12x \
    --cuda-graph-max-bs 32 \
    --mem-fraction-static 0.85 \
    --host 0.0.0.0 --port 5000 \
    --disable-custom-all-reduce

Critical flags:

  • --kv-cache-dtype bf16 — mandatory; fp8_e4m3 produces garbled output on sm120
  • --attention-backend flashinfer — sm120-compatible
  • --quantization modelopt_fp4 — sglang's NVFP4 loader for compressed-tensors format
  • SGLANG_ENABLE_DEEP_GEMM=0 — DeepGEMM needs WGMMA/TCGEN05 absent on sm120

Memory fit: 288 GB weights + KV cache fits on 8x 96GB (≈768 GB total VRAM).

Will not run on sm_90 (H100): NVFP4 is Blackwell-native. Both vLLM (Marlin FP4 PTX mismatch) and sglang (NotImplementedError: Current platform does not support w4a4 nvfp4 quantization) explicitly block sm_90.


🔬 Quantization Method

Produced via AutoRound 0.12.2 layerwise mode on 8x H100 80GB.

Settings

Setting Value Notes
--scheme NVFP4 4-bit weights + FP8 per-group scales
--iters 200 Full tuning (same hyperparameter as GPTQ variant)
--nsamples 512 Calibration samples
--seqlen 2048 Default
--batch_size 8 Default
--low_gpu_mem_usage true Required for ~1TB source on 640GB VRAM
--group_size 16 Matches NVFP4 native 16-element block scale
--format auto_round:llm_compressor Produces compressed-tensors (sglang/vLLM compatible)
--disable_amp true Avoids autocast issues on BF16 source

Calibration Dataset

Source Samples Content
NeelNanda/pile-10k 512 General web text (distribution anchor)

Multi-dataset loading used AutoRound's :concat=true option to pack short samples into full-seqlen sequences.

Wall Time

  • Quantization tuning: 19h 38m (61 blocks, ~20 min/block)
  • Packing + save: ~7 min (58 safetensors shards, 288 GB)
  • Total: ~19.7 hours on 8x H100 80GB

Quality Characteristics

Layer-level loss trajectory (iter 0 → final):

Layer depth iter 0 loss final loss Behavior
0-10 1e-6 to 1e-2 50-80% reduction Early layers, minimal drift
11-30 1e-2 to 1e-1 30-50% reduction Sign-tuning active
31-50 1e-1 to 5e-1 20-30% reduction Accumulating
51-60 5e-1 to 1.9 10-20% reduction Deep-layer drift (layer 60: 1.86 → 1.50)

Weight-validity check (CPU dequant, pre-upload):

  • Cosine similarity vs BF16 source: 0.995+ across all tested layers
  • Relative MAE: ~9% uniformly (typical NVFP4 reconstruction error)

📊 Benchmarks

Pending. Run on 8x RTX PRO 6000 sm120 and report:

Task Score Notes
MMLU (5-shot)
GSM8K
MATH
HumanEval
IFEval strict

Expected ranges (based on GLM-5.1-555B-A14B-REAP-NVFP4 precedent):

  • MMLU: 73-79% (BF16 base ~75-80%, −1 to −2 pp for NVFP4)
  • GSM8K: 78-88% (BF16 base ~80-90%)
  • Decode throughput: 50-70 tok/s @ conc=1, 120-180 tok/s @ conc=4

🧾 Provenance

Step Details
Source model cerebras/DeepSeek-V3.2-REAP-508B-A37B (BF16, ~1.0 TB, 96 safetensors)
Pruning REAP (Relative Expert Activation Pruning) — 384 → 384 experts (structure preserved, 508B is a REAP variant)
Quantization compute Nebius H100x8 via brev
Quant tool Intel AutoRound 0.12.2
Deploy tool voipmonitor/sglang:cu130
Upload date 2026-04-21

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Downloads last month
66
Safetensors
Model size
2B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/DeepSeek-V3.2-508B-NVFP4

Collection including 0xSero/DeepSeek-V3.2-508B-NVFP4

Paper for 0xSero/DeepSeek-V3.2-508B-NVFP4