DSv4-Flash on RTX PRO 6000 Blackwell (SM120) — Lna-Lab Optimized Recipe

Lna-Lab production-tested optimization stack for deepseek-ai/DeepSeek-V4-Flash on NVIDIA RTX PRO 6000 Blackwell Workstation Edition (SM 12.0), built on top of @0xSero's SM120 kernel.

This repository goes beyond getting the model to run on consumer Blackwell silicon — it pushes the system into production-grade decode throughput through:

✅ EAGLE / MTP speculative decoding with accept_rate = 1.00 (perfect alignment)
✅ Increased --max-running-requests with KV-fit verification
✅ Triton MoE GEMM autotune for E=256, N=512, fp8_w8a8
✅ W8A8 Block FP8 GEMM autotune for the dense projection bottleneck (the actual hot path)
✅ Multi-step EAGLE patch for SGLang's compressed attention backend (works on the framework side; ceiling-bound by DSv4-Flash MTP head being single-layer)
✅ 6-GPU parallel autotune harness for fast turnaround on new hardware

TL;DR

# 1. Pull SGLang DSv4-Blackwell image (90 GB, one-time)
docker pull lmsysorg/sglang:deepseek-v4-blackwell

# 2. Build 0xSero kernel (one-time, ~5 min)
git clone https://github.com/0xSero/deepseek-v4-flash-sm120.git
cd deepseek-v4-flash-sm120 && bash scripts/build_in_sglang_docker.sh

# 3. Download SGL FP8 weights (274 GB, ~1.5 h on 200 MB/s link)
HF_HUB_ENABLE_HF_TRANSFER=1 hf download sgl-project/DeepSeek-V4-Flash-FP8 \
  --local-dir /path/to/DeepSeek-V4-Flash-FP8 --max-workers 16

# 4. Drop our config JSONs into the model dir, and launch
cp configs/*.json /path/to/DeepSeek-V4-Flash-FP8/
NUM_STEPS=1 NUM_DRAFT_TOKENS=2 \
  bash scripts/launch_eagle_tuned.sh

OpenAI-compatible server on http://localhost:9000.

Performance summary (RTX PRO 6000 Blackwell × 4, TP=4)

Configuration	Single decode (tok/s)	8-conc aggregate (tok/s)	16-conc internal sustained (tok/s)	Notes
0xSero baseline (no spec, no tuning)	~5	~38	n/a	Per 0xSero docs
RECIPE baseline (`--max-running-requests 8`)	8.7	34	n/a	This repo `launch_lna_sm120.sh`
`+ --max-running-requests 16`	8.7	65	128	`launch_eagle.sh` w/o EAGLE
`+ EAGLE num-steps=1` (accept_rate 1.00)	15.5	128	245	This repo `launch_eagle.sh`
`+ MoE autotune`	15.5	118	246	~0% gain — MoE GEMM not the bottleneck
`+ MoE + W8A8 [1..256] autotune` (final)	15.15	127.87 (steady)	245-250	configs fully adopted; decode bottleneck is upstream of GEMMs (TileLang / mem BW / NCCL)

vs 0xSero (same hardware class)

Metric	0xSero	Lna-Lab	Improvement
Single decode (tok/s)	~5	15.5	3.1×
8-conc aggregate (tok/s)	~38	128	3.4×

Repository layout

.
├── README.md                  ← you are here
├── BENCHMARKS.md              detailed numbers, methodology, repro commands
├── RECIPE.md                  one-shot reproduction guide (every flag explained)
├── PATCH_multistep_eagle_dsv4.md  upstream-PR-quality writeup of multi-step EAGLE patch
├── PATH_FORWARD.md            DSv4.2 / vLLM SM120 maturation horizon (6-month outlook)
├── scripts/
│   ├── launch_lna_sm120.sh        baseline (matches RECIPE.md, max-running 8)
│   ├── launch_eagle.sh            optimized: EAGLE + max-running 16 (Lna-Lab default)
│   ├── launch_eagle_patched.sh    multi-step EAGLE (compressed backend patched)
│   ├── launch_eagle_tuned.sh      EAGLE + autotuned configs
│   └── launch_4layer_sandbox.sh   fast iteration on Pinaster 4-layer (port 9001)
├── autotune/
│   ├── run_w8a8_autotune.sh   6-shape × 6-GPU parallel W8A8 GEMM tune
│   ├── run_autotune_phase2.sh add larger MoE batches [96, 128]
│   ├── run_autotune_phase3.sh reduced-config sweep for [256, 512, 1024+]
│   └── merge_autotune_jsons.py  combine multi-phase JSONs into one config file
├── bench/
│   └── bench_dsv4.py          burst + steady-state benchmark suite
└── configs/                   pre-baked autotuned JSON configs (drop into model dir)

What this repo gets you that 0xSero alone does not

3× decode throughput at single + concurrent — by stacking EAGLE + max-running tuning on top of 0xSero's SM120 kernel.
Multi-step EAGLE works in SGLang DSv4 compressed backend — patch in PATCH_multistep_eagle_dsv4.md. It enables the path; on DSv4-Flash specifically the MTP head is single-layer so the gain is capped at accept_len ≈ 2.0. Should pay off immediately on DSv4.2 Flash when the MTP layer count grows.
6-GPU parallel autotune harness — tuning_block_wise_kernel.py and tuning_fused_moe_triton.py both run with custom batch lists × Ray distribution; full pipeline including JSON merge.
Verified W8A8 dense-GEMM bottleneck identification — MoE GEMM autotune yields ~0% on this model; it's the dense projections that dominate. run_w8a8_autotune.sh targets the right kernels.
Full transparency: failed paths documented (multi-step EAGLE → MTP capped, CUDA graph → TileLang/Inductor collision, --enable-single-batch-overlap → EAGLE incompatibility), so others don't have to repeat them.

Limitations & path forward

DSv4-Flash on SM120 has architectural ceilings independent of any framework patch:

Wall	Cause	Likely Resolution
`accept_len ≈ 2.0` ceiling	MTP head is 1 layer	DSv4.2 Flash with deeper MTP
No CUDA graph	TileLang JIT vs `torch._dynamo`	DSv4 native CUDA path in vLLM (Q3 2026?)
TP=4 hard floor	KV/activation patterns + lack of FP8 native CUTLASS for SM120	TP=2 viable when vLLM matures (H2 2026)

See PATH_FORWARD.md for the full 6-month outlook and which assets in this repo retain value through the maturation cycle.

Hardware

4× NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GiB / GPU, SM 12.0)
Driver/CUDA: see SGLang image base (CUDA 12.9.1)
Optional 5th–6th GPU used by the autotune harness (any SM ≥ 12.0)

Acknowledgements

@0xSero — the SM120 kernel that makes any of this possible
SGLang team — the DSv4 Blackwell image and compressed attention backend
DeepSeek AI — DeepSeek-V4-Flash itself
sgl-project/DeepSeek-V4-Flash-FP8 — SGLang-loadable FP8 weights
Pinaster/DeepSeek-V4-Flash-FP8-4layer — sandbox model for fast iteration

Citation

Lna-Lab. (2026). DSv4-Flash on SM120: Optimized SGLang Recipe.
GitHub: Shinka-Man/dsv4-flash-sm120-optimized
HF configs: sakamakismile/DSv4-Flash-FP8-SM120-Configs

License

MIT.

— Lna-Lab, 2026-04-25

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support