How to run on 4x RTX PRO 6000 Blackwell 96GB (SM120)

#1
by aaron-newsome - opened

Would appreciate any known good configs.

You can base on this: https://github.com/spark-arena/recipe-registry/blob/main/experimental-recipes/minimax-m3/minimax-m3-v0-nvfp4-4x.yaml

sparkrun doesn't (yet) support non-DGX Sparks, although that's coming in the next release.

The revised kernels and work were all focused on DGX Spark (SM121); however, it might work on RTX Pro 6000 (since SM120 and SM121 are essentially the same); I don't recall though if I compiled in explicit SM120 support or not.

But you can try via docker using the image in the sparkrun recipe and then follow the config in the recipe. I guess report back if it works, since that would be good to know!

Note that the sglang version in the container image went through a bunch of patching to get full support & faster loading; we'll try to contribute back anything upstream that is novel/new, but current priority was release.

Couple of thigns?

what did v0 calibrate on (ModelOpt? dataset / sample count)? The quantization-sensitive parts of M3 are the MSA indexer and the MoE router: noise in indexer scores changes which KV blocks survive top-k, so a lightly-calibrated quant tends to show up as long-context retrieval/coherence degradation rather than anything perplexity catches. Worth a v1 with a larger, diverse calibration set (indexer/router/embeddings kept in BF16) unless that's what already ben done.

On these being the same, the silicon is more or less identical, but binary compatibility is one-directional: sm_120 cubins run on sm_121, sm_121 cubins won't load on sm_120, and the block-scaled FP4 MMA paths are a-suffix (sm_121a), which doesn't cross archs in either direction. So if the image was built 121-only it'll hard-fail on a Pro 6000 with mo kernel image is available rather than degrade nicely. The fix is cheap since the SM really is the same: same source, add somethign like '-gencode arch=compute_120a,code=sm_120a` (plus plain 120/120f for the non-FP4 kernels) and ship a fat binary.

FWIW, b12x should have an MSA implementation with SM120 support landing soon. Meanwhile I've got 4× RTX PRO 6000 . I"m doing an NVFP4 quant myself with luke alonzo's calibration recipe (which is awesome and scores well on KLD for the past NVFP4 quants he's made) on m3.

https://discord.gg/BCBUY83EM tahts' a great little server specifically for rtx pro 6ks. if you've seen local inference lab on github (VoipMonitor is the repo man there), this is where you'll convo that's all rtx pro 6k related. Lots of great stuff there.

Spark Arena org

@brandonmusic The experts are quantized but other layers (e.g. router) were left alone. Calibration was done through a short bespoke training set designed around the MiniMax M3 validation cases. (MiniMax open sourced short provider validation tests.) Will likely update the v0 later with improved calibration as we discover gaps. Idea of v0 was to produce a reasonably high quality fast output for people to get their hands on.

Initial builds were built with combination SM120 and SM121 in order to support our RTX Pro 6000 brethren, but a thought occurred to me that's sort of a big deal -- currently only providing an ARM64 build in that image, so that's likely to be less helpful to most people with RTX Pro 6000s. I can potentially add updated container with x86_64/AMD64 build for SM120 users. Will be working to submit (or augment existing PRs) so that we shed the need for a bespoke inference container. (The container is a bundle of PRs because it also e.g. uses instanttensor for loading because otherwise loading MiniMax M3 takes forever...).

minimax-m3-v0-nvfp4-4x.yaml definitely does not work on an intel system since it's an arm container and yes I wasted a bunch of bandwidth and time downloading it to find that out.

i'm running MiniMax M2.7 with voipmonitor/sglang:cu130 with the b12x backend. I'm guessing this doesn't work with M3.

I guess I'll be patiently waiting for a known good config for SM120 on Intel. I do appreciate the releases you post here, even if I can't use them.

Thank you.

Luke i think is making an nvfp4 quant, and I"m exporting one right now to my hf right now. b12x will get support for m3, i'm tryign to get that up and running as well. I have high hopes for this model. It's my frist attemtp at this, but it'll obviously be totally open source if it might be of some help to try out

Spark Arena org

I'm also exporting updated container with x86_64 build, so hopefully soon enough there will be lots of options. I also have high hopes for MiniMax M3. It's worth all of the effort that we're all putting in! :-)

@aaron-newsome soon you'll have too many options!

just uploaded my version https://huggingface.co/brandonmusic/MiniMax-M3-NVFP4 for sm120. Downloading to my local server, to get a docker image up and running to check and make sure it doesnt' produce NAN, or have any issues from the quanting and calibration pass.
kept the calibration data as part of @dbotwinick if it would be helpful for what your working on!

Spark Arena org

Thanks. Didn't know about https://github.com/local-inference-lab/quant-toolkit; it's actually very similar to what I did.

Spark Arena org

@aaron-newsome The sglang v0 container image now supports x86_64 and arm64, so you should be able to use it. I also made sure that we compiled for both sm_120a (RTX Pro 6000) and sm121a (DGX Spark).

Thank you for the update @dbotwinick ! I will give it a whirl right now!

@dbotwinick I wanted to give you an update and let you know the container came right up, without a lot of futzing around. I ran a few million tokens through it and not a single hiccup or crash. I can't thank you enough. Model released on Friday and before the end of Sunday I'm running, with no issue yet. I really appreciate it.

Spark Arena org

@aaron-newsome Awesome. Glad to hear it!!! Sorry it took so long ;)

i don't mean to go full fanboy here but this model you've packed up into NVFP4 is truly unbelievable. i'm not one to spout ridiculous hyperbole, trust me. but this model feels like Opus 4.7 running at home. it's not just the model intelligence though. it's the whole package top to bottom. the sglang speed and stability, the m3 kv cache magic and msa, the drastically reduced tool call failures, all of it contributes to a really amazing experience. in the 24 hours I've had this thing running, all 4 GPU have been fully utilized without break. I've closed DOZENS of issue tickets in the app I'm working on. all without a SINGLE crash! MiniMax literally takes a ticket, sets the status in progress, researches an issue, does the fixes, does all the verifications, updates the ticket, does all the git stuff and moves on to the next. if I could get pi to continue after compaction, this thing might run a week straight just knocking out tickets. i honestly thought we'd never see this level of intelligence in an open weight model in < 250GB of VRAM. just amazing. i give it 10 stars.

Spark Arena org

@aaron-newsome It is a great model. Credit there to the MiniMax team. I also worked really hard to make sure the calibration was as faithful as possible to the official model.

And honestly I worked REALLY hard to get it out quickly and STABLE, so glad you're a fan! I seriously appreciate the feedback since it validates the effort that at least someone is getting good use out of it!

Apparently likes & follows are important, so do me a favor and share some likes/stars/whatever on github, HF, twitter, etc. for Spark Arena and for this model repo.

I've liked, followed and starred everything possible. I'll give you guys a shout out on my youtube channel once it's up and running.

I can't believe I'm the only one who has commented on this fantastic model release. I've been using local models since the very early days of llama models running on my P40 GPUs (which I still use). I'm absolutely BLOWN AWAY by how stable this thing is. I've been running it 24x7 for OVER A WEEK and not a single crash or any weirdness. I have not observed a SINGLE FAILED TOOL CALL. First of all, how is that even possible? I've used every prior Minimax model released, GLM models, various Qwen models, Deepseek v4, oddballs like Mimo and Stepfun and one thing ties them ALL together, random annoying failures of all kinds including tool calls, which really slows the process.

As of today, 8 days in I've clocked 42.41B tokens in and 253.42M tokens out. There have only been a few hours over the 8 days where the model is not generating tokens. Consider it 8 days straight.

Once again, BRAVO to Minimax team and the Spark Arena guys for dropping an absolute MONSTER of a model.

Screenshot_2026-06-23_09.42.27

What’s your use case? If yo don’t mind sharing!

About 95% of the tokens are generated in pi coding agent. I've used pretty much every coding agent including Claude Code but I find pi best for my workflow. Pi is currently building, rebuilding and refining a handful of applications. The workflow is I enter the tickets into the version control system, there's a loop that takes open tickets and launches them into pi for planning and inspection. I then guide the completion of the ticket interactively in pi. The loop makes sure there's no idle time. The strategy is a bit different for work designated to run when I'm not available, overnight. Projects are in Go, Rust, Python, Typescript mostly but there's some oddballs in there too.

The other 5% of tokens comes from a long list of AI enabled apps running locally.

  • Generating image prompts at scale (around 100 per day)
  • Image analysis (a few hundred per day)
  • PaperlessNG AI (document classification, etc)
  • Perplexica (local Perplexity clone)
  • OpenWebUI (basic web based chat)
  • Custom Rag App (a few dozen queries per day)
  • News aggregation, summarization
  • A bunch of other self-hosted AI enabled apps

The 95/5 split isn't scientific but based on vibes. I need to improve my proxy dashboard to better breakdown where LLM requests are coming from.

Sign up or log in to comment