Instructions to use unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
- SGLang
How to use unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Unsloth Studio
How to use unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", max_seq_length=2048, ) - Docker Model Runner
How to use unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with Docker Model Runner:
docker model run hf.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
unsloth for the non-gguf crowd?
@JoeSmith245 try llama.cpp, works very nice with A100s as well!
@JoeSmith245 try llama.cpp, works very nice with A100s as well!
Thanks. I do use llama.cpp sometimes, it's good for running awkard quant sizes for models that barely fit in vram. But, usually I much prefer vllm or sglang -- they're much more performant. llama.cpp barely hits 20% utilization of my GPUs' capabilities sometimes, whereas vLLM can max them all out at 100%, for the highest throughput possible.
@JoeSmith245 try llama.cpp, works very nice with A100s as well!
Thanks. I do use llama.cpp sometimes, it's good for running awkard quant sizes for models that barely fit in vram. But, usually I much prefer vllm or sglang -- they're much more performant. llama.cpp barely hits 20% utilization of my GPUs' capabilities sometimes, whereas vLLM can max them all out at 100%, for the highest throughput possible.
Not nemotron or AWQ, but these might be interesting:
- Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
- Qwen/Qwen3.5-122B-A10B-GPTQ-Int4
...
Run very nicely on Amperes + vLLM, and the models are definitely up there.
@JoeSmith245 try llama.cpp, works very nice with A100s as well!
Not nemotron or AWQ
I gave up on Nemotron 3 after some testing. Nemotron is still very interesting to me, as a "strong" open source model from a trusted company. But it's "not quite there" as a SOTA OSS model, imho. Maybe with 3.5 or 4, especially if they add vision or omni ;)
but these might be interesting:
- Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
- Qwen/Qwen3.5-122B-A10B-GPTQ-Int4
Run very nicely on Amperes + vLLM, and the models are definitely up there.
If you can run Qwen 397B, I would very strongly encourage you to try minimax M2.5 at around @q4 (or whatever fits), or (if you enjoy "smart but verbose" reasoning) Step-3.5-Flash. I'm really torn between these two as the best that will run on my hardware at the moment: both are excellent with OpenCode. If you have a ton of VRAM; if you're running 400B models @q8 or more, you could try Kimi at a very low quant.
I can't run ~400B models (in GPU). I have 4x3090 (for 96GB) + 1x 3070 (another 8GB, which I normally use for other things like embedding/tts/asr/ocr). But in 96GB, I can run Minimax M2.5 at Q2_XSS, and Qwen3.5-122B at higher quants is a bad joke by comparison. Only Step-3.5-Flash has been in the same league.
I honestly don't get the fuss about Qwen: I used it for a short time just prior to QwQ, but I've tried every Qwen reasoning model since QwQ, and they've ALL had serious quality issues like looping (well known) but also "pseudo-reasoning": a lot of "fake" reasoning phrases that sound like thought, but aren't actually coherent/relevant in the situation, so just waste tokens and confuse the outcome. For example, "ah-hah! it's X", when it's clearly not X at all. Sadly this doesn't get benchmarked much. OckBench exists, but doesn't cover many models. Qwen 2.5 scored 2.3% "useful" reasoning (the rest being wasted or even misleading tokens) by one analysis of its reasoning, according to Liquid AI: https://www.linkedin.com/posts/maxime-labonne_most-reasoning-steps-in-llms-are-just-activity-7437484983092588545-8L31. Step-3.5-Flash is also guilty of this sometimes, but not as much. MiniMax is much more coherent, even at Q2, and is much less wasteful of CoT tokens in general.
The one thing that qwen offers is good, accurate vision support, but I'm working on running that @Q4 on my 8GB card with reasoning disabled, for vision alone, and routing image-to-text requests through a proxy such that the rest goes to Minimax/Step.
BTW: both minimax m2.7 and Step-3.6-Flash are due soon (a few weeks). I'd encourage you to take a look at those when they finally drop, too. Also, minimax M3 should be multimodal.