YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

DSv4-Flash on RTX PRO 6000 Blackwell (SM120) β€” Lna-Lab Optimized Recipe

Lna-Lab production-tested optimization stack for deepseek-ai/DeepSeek-V4-Flash on NVIDIA RTX PRO 6000 Blackwell Workstation Edition (SM 12.0), built on top of @0xSero's SM120 kernel.

This repository goes beyond getting the model to run on consumer Blackwell silicon β€” it pushes the system into production-grade decode throughput through:

  • βœ… EAGLE / MTP speculative decoding with accept_rate = 1.00 (perfect alignment)
  • βœ… Increased --max-running-requests with KV-fit verification
  • βœ… Triton MoE GEMM autotune for E=256, N=512, fp8_w8a8
  • βœ… W8A8 Block FP8 GEMM autotune for the dense projection bottleneck (the actual hot path)
  • βœ… Multi-step EAGLE patch for SGLang's compressed attention backend (works on the framework side; ceiling-bound by DSv4-Flash MTP head being single-layer)
  • βœ… 6-GPU parallel autotune harness for fast turnaround on new hardware

TL;DR

# 1. Pull SGLang DSv4-Blackwell image (90 GB, one-time)
docker pull lmsysorg/sglang:deepseek-v4-blackwell

# 2. Build 0xSero kernel (one-time, ~5 min)
git clone https://github.com/0xSero/deepseek-v4-flash-sm120.git
cd deepseek-v4-flash-sm120 && bash scripts/build_in_sglang_docker.sh

# 3. Download SGL FP8 weights (274 GB, ~1.5 h on 200 MB/s link)
HF_HUB_ENABLE_HF_TRANSFER=1 hf download sgl-project/DeepSeek-V4-Flash-FP8 \
  --local-dir /path/to/DeepSeek-V4-Flash-FP8 --max-workers 16

# 4. Drop our config JSONs into the model dir, and launch
cp configs/*.json /path/to/DeepSeek-V4-Flash-FP8/
NUM_STEPS=1 NUM_DRAFT_TOKENS=2 \
  bash scripts/launch_eagle_tuned.sh

OpenAI-compatible server on http://localhost:9000.


Performance summary (RTX PRO 6000 Blackwell Γ— 4, TP=4)

Configuration Single decode (tok/s) 8-conc aggregate (tok/s) 16-conc internal sustained (tok/s) Notes
0xSero baseline (no spec, no tuning) ~5 ~38 n/a Per 0xSero docs
RECIPE baseline (--max-running-requests 8) 8.7 34 n/a This repo launch_lna_sm120.sh
+ --max-running-requests 16 8.7 65 128 launch_eagle.sh w/o EAGLE
+ EAGLE num-steps=1 (accept_rate 1.00) 15.5 128 245 This repo launch_eagle.sh
+ MoE autotune 15.5 118 246 ~0% gain β€” MoE GEMM not the bottleneck
+ MoE + W8A8 [1..256] autotune (final) 15.15 127.87 (steady) 245-250 configs fully adopted; decode bottleneck is upstream of GEMMs (TileLang / mem BW / NCCL)

vs 0xSero (same hardware class)

Metric 0xSero Lna-Lab Improvement
Single decode (tok/s) ~5 15.5 3.1Γ—
8-conc aggregate (tok/s) ~38 128 3.4Γ—

Repository layout

.
β”œβ”€β”€ README.md                  ← you are here
β”œβ”€β”€ BENCHMARKS.md              detailed numbers, methodology, repro commands
β”œβ”€β”€ RECIPE.md                  one-shot reproduction guide (every flag explained)
β”œβ”€β”€ PATCH_multistep_eagle_dsv4.md  upstream-PR-quality writeup of multi-step EAGLE patch
β”œβ”€β”€ PATH_FORWARD.md            DSv4.2 / vLLM SM120 maturation horizon (6-month outlook)
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ launch_lna_sm120.sh        baseline (matches RECIPE.md, max-running 8)
β”‚   β”œβ”€β”€ launch_eagle.sh            optimized: EAGLE + max-running 16 (Lna-Lab default)
β”‚   β”œβ”€β”€ launch_eagle_patched.sh    multi-step EAGLE (compressed backend patched)
β”‚   β”œβ”€β”€ launch_eagle_tuned.sh      EAGLE + autotuned configs
β”‚   └── launch_4layer_sandbox.sh   fast iteration on Pinaster 4-layer (port 9001)
β”œβ”€β”€ autotune/
β”‚   β”œβ”€β”€ run_w8a8_autotune.sh   6-shape Γ— 6-GPU parallel W8A8 GEMM tune
β”‚   β”œβ”€β”€ run_autotune_phase2.sh add larger MoE batches [96, 128]
β”‚   β”œβ”€β”€ run_autotune_phase3.sh reduced-config sweep for [256, 512, 1024+]
β”‚   └── merge_autotune_jsons.py  combine multi-phase JSONs into one config file
β”œβ”€β”€ bench/
β”‚   └── bench_dsv4.py          burst + steady-state benchmark suite
└── configs/                   pre-baked autotuned JSON configs (drop into model dir)

What this repo gets you that 0xSero alone does not

  1. 3Γ— decode throughput at single + concurrent β€” by stacking EAGLE + max-running tuning on top of 0xSero's SM120 kernel.
  2. Multi-step EAGLE works in SGLang DSv4 compressed backend β€” patch in PATCH_multistep_eagle_dsv4.md. It enables the path; on DSv4-Flash specifically the MTP head is single-layer so the gain is capped at accept_len β‰ˆ 2.0. Should pay off immediately on DSv4.2 Flash when the MTP layer count grows.
  3. 6-GPU parallel autotune harness β€” tuning_block_wise_kernel.py and tuning_fused_moe_triton.py both run with custom batch lists Γ— Ray distribution; full pipeline including JSON merge.
  4. Verified W8A8 dense-GEMM bottleneck identification β€” MoE GEMM autotune yields ~0% on this model; it's the dense projections that dominate. run_w8a8_autotune.sh targets the right kernels.
  5. Full transparency: failed paths documented (multi-step EAGLE β†’ MTP capped, CUDA graph β†’ TileLang/Inductor collision, --enable-single-batch-overlap β†’ EAGLE incompatibility), so others don't have to repeat them.

Limitations & path forward

DSv4-Flash on SM120 has architectural ceilings independent of any framework patch:

Wall Cause Likely Resolution
accept_len β‰ˆ 2.0 ceiling MTP head is 1 layer DSv4.2 Flash with deeper MTP
No CUDA graph TileLang JIT vs torch._dynamo DSv4 native CUDA path in vLLM (Q3 2026?)
TP=4 hard floor KV/activation patterns + lack of FP8 native CUTLASS for SM120 TP=2 viable when vLLM matures (H2 2026)

See PATH_FORWARD.md for the full 6-month outlook and which assets in this repo retain value through the maturation cycle.


Hardware

  • 4Γ— NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GiB / GPU, SM 12.0)
  • Driver/CUDA: see SGLang image base (CUDA 12.9.1)
  • Optional 5th–6th GPU used by the autotune harness (any SM β‰₯ 12.0)

Acknowledgements

Citation

Lna-Lab. (2026). DSv4-Flash on SM120: Optimized SGLang Recipe.
GitHub: Shinka-Man/dsv4-flash-sm120-optimized
HF configs: sakamakismile/DSv4-Flash-FP8-SM120-Configs

License

MIT.

β€” Lna-Lab, 2026-04-25

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support