YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
DSv4-Flash on RTX PRO 6000 Blackwell (SM120) β Lna-Lab Optimized Recipe
Lna-Lab production-tested optimization stack for
deepseek-ai/DeepSeek-V4-Flashon NVIDIA RTX PRO 6000 Blackwell Workstation Edition (SM 12.0), built on top of @0xSero's SM120 kernel.
This repository goes beyond getting the model to run on consumer Blackwell silicon β it pushes the system into production-grade decode throughput through:
- β
EAGLE / MTP speculative decoding with
accept_rate = 1.00(perfect alignment) - β
Increased
--max-running-requestswith KV-fit verification - β
Triton MoE GEMM autotune for
E=256, N=512, fp8_w8a8 - β W8A8 Block FP8 GEMM autotune for the dense projection bottleneck (the actual hot path)
- β
Multi-step EAGLE patch for SGLang's
compressedattention backend (works on the framework side; ceiling-bound by DSv4-Flash MTP head being single-layer) - β 6-GPU parallel autotune harness for fast turnaround on new hardware
TL;DR
# 1. Pull SGLang DSv4-Blackwell image (90 GB, one-time)
docker pull lmsysorg/sglang:deepseek-v4-blackwell
# 2. Build 0xSero kernel (one-time, ~5 min)
git clone https://github.com/0xSero/deepseek-v4-flash-sm120.git
cd deepseek-v4-flash-sm120 && bash scripts/build_in_sglang_docker.sh
# 3. Download SGL FP8 weights (274 GB, ~1.5 h on 200 MB/s link)
HF_HUB_ENABLE_HF_TRANSFER=1 hf download sgl-project/DeepSeek-V4-Flash-FP8 \
--local-dir /path/to/DeepSeek-V4-Flash-FP8 --max-workers 16
# 4. Drop our config JSONs into the model dir, and launch
cp configs/*.json /path/to/DeepSeek-V4-Flash-FP8/
NUM_STEPS=1 NUM_DRAFT_TOKENS=2 \
bash scripts/launch_eagle_tuned.sh
OpenAI-compatible server on http://localhost:9000.
Performance summary (RTX PRO 6000 Blackwell Γ 4, TP=4)
| Configuration | Single decode (tok/s) | 8-conc aggregate (tok/s) | 16-conc internal sustained (tok/s) | Notes |
|---|---|---|---|---|
| 0xSero baseline (no spec, no tuning) | ~5 | ~38 | n/a | Per 0xSero docs |
RECIPE baseline (--max-running-requests 8) |
8.7 | 34 | n/a | This repo launch_lna_sm120.sh |
+ --max-running-requests 16 |
8.7 | 65 | 128 | launch_eagle.sh w/o EAGLE |
+ EAGLE num-steps=1 (accept_rate 1.00) |
15.5 | 128 | 245 | This repo launch_eagle.sh |
+ MoE autotune |
15.5 | 118 | 246 | ~0% gain β MoE GEMM not the bottleneck |
+ MoE + W8A8 [1..256] autotune (final) |
15.15 | 127.87 (steady) | 245-250 | configs fully adopted; decode bottleneck is upstream of GEMMs (TileLang / mem BW / NCCL) |
vs 0xSero (same hardware class)
| Metric | 0xSero | Lna-Lab | Improvement |
|---|---|---|---|
| Single decode (tok/s) | ~5 | 15.5 | 3.1Γ |
| 8-conc aggregate (tok/s) | ~38 | 128 | 3.4Γ |
Repository layout
.
βββ README.md β you are here
βββ BENCHMARKS.md detailed numbers, methodology, repro commands
βββ RECIPE.md one-shot reproduction guide (every flag explained)
βββ PATCH_multistep_eagle_dsv4.md upstream-PR-quality writeup of multi-step EAGLE patch
βββ PATH_FORWARD.md DSv4.2 / vLLM SM120 maturation horizon (6-month outlook)
βββ scripts/
β βββ launch_lna_sm120.sh baseline (matches RECIPE.md, max-running 8)
β βββ launch_eagle.sh optimized: EAGLE + max-running 16 (Lna-Lab default)
β βββ launch_eagle_patched.sh multi-step EAGLE (compressed backend patched)
β βββ launch_eagle_tuned.sh EAGLE + autotuned configs
β βββ launch_4layer_sandbox.sh fast iteration on Pinaster 4-layer (port 9001)
βββ autotune/
β βββ run_w8a8_autotune.sh 6-shape Γ 6-GPU parallel W8A8 GEMM tune
β βββ run_autotune_phase2.sh add larger MoE batches [96, 128]
β βββ run_autotune_phase3.sh reduced-config sweep for [256, 512, 1024+]
β βββ merge_autotune_jsons.py combine multi-phase JSONs into one config file
βββ bench/
β βββ bench_dsv4.py burst + steady-state benchmark suite
βββ configs/ pre-baked autotuned JSON configs (drop into model dir)
What this repo gets you that 0xSero alone does not
- 3Γ decode throughput at single + concurrent β by stacking EAGLE + max-running tuning on top of 0xSero's SM120 kernel.
- Multi-step EAGLE works in SGLang DSv4 compressed backend β patch in
PATCH_multistep_eagle_dsv4.md. It enables the path; on DSv4-Flash specifically the MTP head is single-layer so the gain is capped ataccept_len β 2.0. Should pay off immediately on DSv4.2 Flash when the MTP layer count grows. - 6-GPU parallel autotune harness β
tuning_block_wise_kernel.pyandtuning_fused_moe_triton.pyboth run with custom batch lists Γ Ray distribution; full pipeline including JSON merge. - Verified W8A8 dense-GEMM bottleneck identification β MoE GEMM autotune yields ~0% on this model; it's the dense projections that dominate.
run_w8a8_autotune.shtargets the right kernels. - Full transparency: failed paths documented (multi-step EAGLE β MTP capped, CUDA graph β TileLang/Inductor collision,
--enable-single-batch-overlapβ EAGLE incompatibility), so others don't have to repeat them.
Limitations & path forward
DSv4-Flash on SM120 has architectural ceilings independent of any framework patch:
| Wall | Cause | Likely Resolution |
|---|---|---|
accept_len β 2.0 ceiling |
MTP head is 1 layer | DSv4.2 Flash with deeper MTP |
| No CUDA graph | TileLang JIT vs torch._dynamo |
DSv4 native CUDA path in vLLM (Q3 2026?) |
| TP=4 hard floor | KV/activation patterns + lack of FP8 native CUTLASS for SM120 | TP=2 viable when vLLM matures (H2 2026) |
See PATH_FORWARD.md for the full 6-month outlook and which assets in this repo retain value through the maturation cycle.
Hardware
- 4Γ NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GiB / GPU, SM 12.0)
- Driver/CUDA: see SGLang image base (CUDA 12.9.1)
- Optional 5thβ6th GPU used by the autotune harness (any SM β₯ 12.0)
Acknowledgements
- @0xSero β the SM120 kernel that makes any of this possible
- SGLang team β the DSv4 Blackwell image and compressed attention backend
- DeepSeek AI β DeepSeek-V4-Flash itself
sgl-project/DeepSeek-V4-Flash-FP8β SGLang-loadable FP8 weightsPinaster/DeepSeek-V4-Flash-FP8-4layerβ sandbox model for fast iteration
Citation
Lna-Lab. (2026). DSv4-Flash on SM120: Optimized SGLang Recipe.
GitHub: Shinka-Man/dsv4-flash-sm120-optimized
HF configs: sakamakismile/DSv4-Flash-FP8-SM120-Configs
License
MIT.
β Lna-Lab, 2026-04-25
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support