Instructions to use 0xSero/DeepSeek-V3.2-508B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 0xSero/DeepSeek-V3.2-508B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="0xSero/DeepSeek-V3.2-508B-NVFP4")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("0xSero/DeepSeek-V3.2-508B-NVFP4")
model = AutoModelForCausalLM.from_pretrained("0xSero/DeepSeek-V3.2-508B-NVFP4")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use 0xSero/DeepSeek-V3.2-508B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0xSero/DeepSeek-V3.2-508B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/DeepSeek-V3.2-508B-NVFP4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/0xSero/DeepSeek-V3.2-508B-NVFP4

SGLang

How to use 0xSero/DeepSeek-V3.2-508B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "0xSero/DeepSeek-V3.2-508B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/DeepSeek-V3.2-508B-NVFP4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "0xSero/DeepSeek-V3.2-508B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/DeepSeek-V3.2-508B-NVFP4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use 0xSero/DeepSeek-V3.2-508B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/0xSero/DeepSeek-V3.2-508B-NVFP4
```

Support this work → · X · GitHub · REAP paper · Cerebras REAP

DeepSeek-V3.2-508B-NVFP4

NVFP4 quantization of cerebras/DeepSeek-V3.2-REAP-508B-A37B.

At a glance


Base model	cerebras/DeepSeek-V3.2-REAP-508B-A37B
Format	NVFP4
Total params	508B
Active / token	37B
Experts / layer	192
Layers	61
Hidden size	7168
Context	163,840
On-disk size	288 GB

Which variant should I pick?

Variant	Format	Link
`DeepSeek-V3.2-345B-W3A16`	W3A16	link
`DeepSeek-V3.2-508B-NVFP4` (this)	NVFP4	link

𓌳 REAP 𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression

📄 Paper • 💻 Code

DeepSeek-V3.2-REAP-508B-A37B-NVFP4

REAP-pruned + NVFP4 quantized DeepSeek-V3.2 for efficient deployment on NVIDIA Blackwell (sm120).

This is the first publicly available NVFP4-quantized variant of the 508B-parameter REAP-pruned DeepSeek-V3.2, targeting 8x RTX PRO 6000 Blackwell 96GB deployments via sglang.

📋 Model Specifications

Property	Value
Base Model	`cerebras/DeepSeek-V3.2-REAP-508B-A37B` (REAP-pruned from `deepseek-ai/DeepSeek-V3.2`)
Architecture	`DeepseekV3ForCausalLM` (MoE with MLA)
Params	508B total, ~37B active per token (top-8 of 384 routed + 1 shared)
Base precision	BF16 (source: ~1.0 TB)
Quantization	NVFP4 (4-bit weights + FP8 per-group scales, group=16)
Output size	288 GB (~3.6x compression)
Experts per MoE layer	384 routed + 1 shared
Layers	61
Hidden size	7168
Format	`nvfp4-pack-quantized` via `compressed-tensors`

🚀 Deploy on sm120 (RTX PRO 6000 Blackwell)

Uses pre-built voipmonitor/sglang:cu130 Docker image with all sm120 patches applied.

docker run --gpus all --ipc=host --shm-size=8g --network=host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v jit-cache:/cache/jit \
  -e SGLANG_ENABLE_SPEC_V2=True \
  -e SGLANG_ENABLE_JIT_DEEPGEMM=0 \
  -e SGLANG_ENABLE_DEEP_GEMM=0 \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_P2P_LEVEL=SYS \
  -e NCCL_MIN_NCHANNELS=8 \
  voipmonitor/sglang:cu130 \
  python3 -m sglang.launch_server \
    --model-path 0xSero/DeepSeek-V3.2-508B-NVFP4 \
    --served-model-name deepseek-v32-reap-nvfp4 \
    --tensor-parallel-size 8 \
    --quantization modelopt_fp4 \
    --kv-cache-dtype bf16 \
    --trust-remote-code \
    --attention-backend flashinfer \
    --moe-runner-backend b12x \
    --cuda-graph-max-bs 32 \
    --mem-fraction-static 0.85 \
    --host 0.0.0.0 --port 5000 \
    --disable-custom-all-reduce

Critical flags:

--kv-cache-dtype bf16 — mandatory; fp8_e4m3 produces garbled output on sm120
--attention-backend flashinfer — sm120-compatible
--quantization modelopt_fp4 — sglang's NVFP4 loader for compressed-tensors format
SGLANG_ENABLE_DEEP_GEMM=0 — DeepGEMM needs WGMMA/TCGEN05 absent on sm120

Memory fit: 288 GB weights + KV cache fits on 8x 96GB (≈768 GB total VRAM).

Will not run on sm_90 (H100): NVFP4 is Blackwell-native. Both vLLM (Marlin FP4 PTX mismatch) and sglang (NotImplementedError: Current platform does not support w4a4 nvfp4 quantization) explicitly block sm_90.

🔬 Quantization Method

Produced via AutoRound 0.12.2 layerwise mode on 8x H100 80GB.

Settings

Setting	Value	Notes
`--scheme`	NVFP4	4-bit weights + FP8 per-group scales
`--iters`	200	Full tuning (same hyperparameter as GPTQ variant)
`--nsamples`	512	Calibration samples
`--seqlen`	2048	Default
`--batch_size`	8	Default
`--low_gpu_mem_usage`	true	Required for ~1TB source on 640GB VRAM
`--group_size`	16	Matches NVFP4 native 16-element block scale
`--format`	auto_round:llm_compressor	Produces compressed-tensors (sglang/vLLM compatible)
`--disable_amp`	true	Avoids autocast issues on BF16 source

Calibration Dataset

Source	Samples	Content
NeelNanda/pile-10k	512	General web text (distribution anchor)

Multi-dataset loading used AutoRound's :concat=true option to pack short samples into full-seqlen sequences.

Wall Time

Quantization tuning: 19h 38m (61 blocks, ~20 min/block)
Packing + save: ~7 min (58 safetensors shards, 288 GB)
Total: ~19.7 hours on 8x H100 80GB

Quality Characteristics

Layer-level loss trajectory (iter 0 → final):

Layer depth	iter 0 loss	final loss	Behavior
0-10	1e-6 to 1e-2	50-80% reduction	Early layers, minimal drift
11-30	1e-2 to 1e-1	30-50% reduction	Sign-tuning active
31-50	1e-1 to 5e-1	20-30% reduction	Accumulating
51-60	5e-1 to 1.9	10-20% reduction	Deep-layer drift (layer 60: 1.86 → 1.50)

Weight-validity check (CPU dequant, pre-upload):

Cosine similarity vs BF16 source: 0.995+ across all tested layers
Relative MAE: ~9% uniformly (typical NVFP4 reconstruction error)

📊 Benchmarks

Pending. Run on 8x RTX PRO 6000 sm120 and report:

Task	Score	Notes
MMLU (5-shot)	—
GSM8K	—
MATH	—
HumanEval	—
IFEval strict	—

Expected ranges (based on GLM-5.1-555B-A14B-REAP-NVFP4 precedent):

MMLU: 73-79% (BF16 base ~75-80%, −1 to −2 pp for NVFP4)
GSM8K: 78-88% (BF16 base ~80-90%)
Decode throughput: 50-70 tok/s @ conc=1, 120-180 tok/s @ conc=4

🧾 Provenance

Step	Details
Source model	`cerebras/DeepSeek-V3.2-REAP-508B-A37B` (BF16, ~1.0 TB, 96 safetensors)
Pruning	REAP (Relative Expert Activation Pruning) — 384 → 384 experts (structure preserved, 508B is a REAP variant)
Quantization compute	Nebius H100x8 via brev
Quant tool	Intel AutoRound 0.12.2
Deploy tool	voipmonitor/sglang:cu130
Upload date	2026-04-21

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Model tree for 0xSero/DeepSeek-V3.2-508B-NVFP4

Base model

deepseek-ai/DeepSeek-V3.2-Exp-Base

Finetuned

deepseek-ai/DeepSeek-V3.2

Quantized

cerebras/DeepSeek-V3.2-REAP-508B-A37B

Quantized

(1)

this model

Collection including 0xSero/DeepSeek-V3.2-508B-NVFP4

DeepSeek — REAP

Collection

REAP-pruned & quantized DeepSeek-V4-Flash / V3.2. • 7 items • Updated 4 days ago

Paper for 0xSero/DeepSeek-V3.2-508B-NVFP4

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 20

0xSero
/

DeepSeek-V3.2-508B-NVFP4