Instructions to use 0xSero/DeepSeek-V3.2-508B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 0xSero/DeepSeek-V3.2-508B-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="0xSero/DeepSeek-V3.2-508B-NVFP4")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("0xSero/DeepSeek-V3.2-508B-NVFP4") model = AutoModelForCausalLM.from_pretrained("0xSero/DeepSeek-V3.2-508B-NVFP4") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use 0xSero/DeepSeek-V3.2-508B-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "0xSero/DeepSeek-V3.2-508B-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/DeepSeek-V3.2-508B-NVFP4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/0xSero/DeepSeek-V3.2-508B-NVFP4
- SGLang
How to use 0xSero/DeepSeek-V3.2-508B-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "0xSero/DeepSeek-V3.2-508B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/DeepSeek-V3.2-508B-NVFP4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "0xSero/DeepSeek-V3.2-508B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/DeepSeek-V3.2-508B-NVFP4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use 0xSero/DeepSeek-V3.2-508B-NVFP4 with Docker Model Runner:
docker model run hf.co/0xSero/DeepSeek-V3.2-508B-NVFP4
Support this work → · X · GitHub · REAP paper · Cerebras REAP
DeepSeek-V3.2-508B-NVFP4
NVFP4 quantization of cerebras/DeepSeek-V3.2-REAP-508B-A37B.
At a glance
| Base model | cerebras/DeepSeek-V3.2-REAP-508B-A37B |
| Format | NVFP4 |
| Total params | 508B |
| Active / token | 37B |
| Experts / layer | 192 |
| Layers | 61 |
| Hidden size | 7168 |
| Context | 163,840 |
| On-disk size | 288 GB |
Which variant should I pick?
𓌳 REAP 𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression
DeepSeek-V3.2-REAP-508B-A37B-NVFP4
REAP-pruned + NVFP4 quantized DeepSeek-V3.2 for efficient deployment on NVIDIA Blackwell (sm120).
This is the first publicly available NVFP4-quantized variant of the 508B-parameter REAP-pruned DeepSeek-V3.2, targeting 8x RTX PRO 6000 Blackwell 96GB deployments via sglang.
📋 Model Specifications
| Property | Value |
|---|---|
| Base Model | cerebras/DeepSeek-V3.2-REAP-508B-A37B (REAP-pruned from deepseek-ai/DeepSeek-V3.2) |
| Architecture | DeepseekV3ForCausalLM (MoE with MLA) |
| Params | 508B total, ~37B active per token (top-8 of 384 routed + 1 shared) |
| Base precision | BF16 (source: ~1.0 TB) |
| Quantization | NVFP4 (4-bit weights + FP8 per-group scales, group=16) |
| Output size | 288 GB (~3.6x compression) |
| Experts per MoE layer | 384 routed + 1 shared |
| Layers | 61 |
| Hidden size | 7168 |
| Format | nvfp4-pack-quantized via compressed-tensors |
🚀 Deploy on sm120 (RTX PRO 6000 Blackwell)
Uses pre-built voipmonitor/sglang:cu130 Docker image with all sm120 patches applied.
docker run --gpus all --ipc=host --shm-size=8g --network=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v jit-cache:/cache/jit \
-e SGLANG_ENABLE_SPEC_V2=True \
-e SGLANG_ENABLE_JIT_DEEPGEMM=0 \
-e SGLANG_ENABLE_DEEP_GEMM=0 \
-e NCCL_IB_DISABLE=1 \
-e NCCL_P2P_LEVEL=SYS \
-e NCCL_MIN_NCHANNELS=8 \
voipmonitor/sglang:cu130 \
python3 -m sglang.launch_server \
--model-path 0xSero/DeepSeek-V3.2-508B-NVFP4 \
--served-model-name deepseek-v32-reap-nvfp4 \
--tensor-parallel-size 8 \
--quantization modelopt_fp4 \
--kv-cache-dtype bf16 \
--trust-remote-code \
--attention-backend flashinfer \
--moe-runner-backend b12x \
--cuda-graph-max-bs 32 \
--mem-fraction-static 0.85 \
--host 0.0.0.0 --port 5000 \
--disable-custom-all-reduce
Critical flags:
--kv-cache-dtype bf16— mandatory; fp8_e4m3 produces garbled output on sm120--attention-backend flashinfer— sm120-compatible--quantization modelopt_fp4— sglang's NVFP4 loader for compressed-tensors formatSGLANG_ENABLE_DEEP_GEMM=0— DeepGEMM needs WGMMA/TCGEN05 absent on sm120
Memory fit: 288 GB weights + KV cache fits on 8x 96GB (≈768 GB total VRAM).
Will not run on sm_90 (H100): NVFP4 is Blackwell-native. Both vLLM (Marlin FP4 PTX mismatch) and sglang (NotImplementedError: Current platform does not support w4a4 nvfp4 quantization) explicitly block sm_90.
🔬 Quantization Method
Produced via AutoRound 0.12.2 layerwise mode on 8x H100 80GB.
Settings
| Setting | Value | Notes |
|---|---|---|
--scheme |
NVFP4 | 4-bit weights + FP8 per-group scales |
--iters |
200 | Full tuning (same hyperparameter as GPTQ variant) |
--nsamples |
512 | Calibration samples |
--seqlen |
2048 | Default |
--batch_size |
8 | Default |
--low_gpu_mem_usage |
true | Required for ~1TB source on 640GB VRAM |
--group_size |
16 | Matches NVFP4 native 16-element block scale |
--format |
auto_round:llm_compressor | Produces compressed-tensors (sglang/vLLM compatible) |
--disable_amp |
true | Avoids autocast issues on BF16 source |
Calibration Dataset
| Source | Samples | Content |
|---|---|---|
| NeelNanda/pile-10k | 512 | General web text (distribution anchor) |
Multi-dataset loading used AutoRound's :concat=true option to pack short samples into full-seqlen sequences.
Wall Time
- Quantization tuning: 19h 38m (61 blocks, ~20 min/block)
- Packing + save: ~7 min (58 safetensors shards, 288 GB)
- Total: ~19.7 hours on 8x H100 80GB
Quality Characteristics
Layer-level loss trajectory (iter 0 → final):
| Layer depth | iter 0 loss | final loss | Behavior |
|---|---|---|---|
| 0-10 | 1e-6 to 1e-2 | 50-80% reduction | Early layers, minimal drift |
| 11-30 | 1e-2 to 1e-1 | 30-50% reduction | Sign-tuning active |
| 31-50 | 1e-1 to 5e-1 | 20-30% reduction | Accumulating |
| 51-60 | 5e-1 to 1.9 | 10-20% reduction | Deep-layer drift (layer 60: 1.86 → 1.50) |
Weight-validity check (CPU dequant, pre-upload):
- Cosine similarity vs BF16 source: 0.995+ across all tested layers
- Relative MAE: ~9% uniformly (typical NVFP4 reconstruction error)
📊 Benchmarks
Pending. Run on 8x RTX PRO 6000 sm120 and report:
| Task | Score | Notes |
|---|---|---|
| MMLU (5-shot) | — | |
| GSM8K | — | |
| MATH | — | |
| HumanEval | — | |
| IFEval strict | — |
Expected ranges (based on GLM-5.1-555B-A14B-REAP-NVFP4 precedent):
- MMLU: 73-79% (BF16 base ~75-80%, −1 to −2 pp for NVFP4)
- GSM8K: 78-88% (BF16 base ~80-90%)
- Decode throughput: 50-70 tok/s @ conc=1, 120-180 tok/s @ conc=4
🧾 Provenance
| Step | Details |
|---|---|
| Source model | cerebras/DeepSeek-V3.2-REAP-508B-A37B (BF16, ~1.0 TB, 96 safetensors) |
| Pruning | REAP (Relative Expert Activation Pruning) — 384 → 384 experts (structure preserved, 508B is a REAP variant) |
| Quantization compute | Nebius H100x8 via brev |
| Quant tool | Intel AutoRound 0.12.2 |
| Deploy tool | voipmonitor/sglang:cu130 |
| Upload date | 2026-04-21 |
License & citation
License inherited from the base model.
@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
- Downloads last month
- 66
Model tree for 0xSero/DeepSeek-V3.2-508B-NVFP4
Base model
deepseek-ai/DeepSeek-V3.2-Exp-Base