Instructions to use unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

SGLang

How to use unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Unsloth Studio

How to use unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
    max_seq_length=2048,
)

Docker Model Runner
How to use unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
```

unsloth for the non-gguf crowd?

by deleted - opened Mar 11

Discussion

deleted

Mar 11

It would be great if unsloth could start providing AWQ 8bit/4bit and autoround quants, for those of us who want to run vllm but are running ampere and the like :)

fsaudm

Mar 20

@JoeSmith245 try llama.cpp, works very nice with A100s as well!

deleted

Mar 21

@JoeSmith245 try llama.cpp, works very nice with A100s as well!

Thanks. I do use llama.cpp sometimes, it's good for running awkard quant sizes for models that barely fit in vram. But, usually I much prefer vllm or sglang -- they're much more performant. llama.cpp barely hits 20% utilization of my GPUs' capabilities sometimes, whereas vLLM can max them all out at 100%, for the highest throughput possible.

fsaudm

Mar 24

•

edited Mar 24

@JoeSmith245 try llama.cpp, works very nice with A100s as well!

Thanks. I do use llama.cpp sometimes, it's good for running awkard quant sizes for models that barely fit in vram. But, usually I much prefer vllm or sglang -- they're much more performant. llama.cpp barely hits 20% utilization of my GPUs' capabilities sometimes, whereas vLLM can max them all out at 100%, for the highest throughput possible.

Not nemotron or AWQ, but these might be interesting:

Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
Qwen/Qwen3.5-122B-A10B-GPTQ-Int4
...

Run very nicely on Amperes + vLLM, and the models are definitely up there.

deleted

Mar 24

•

edited Mar 24

@JoeSmith245 try llama.cpp, works very nice with A100s as well!

Not nemotron or AWQ

I gave up on Nemotron 3 after some testing. Nemotron is still very interesting to me, as a "strong" open source model from a trusted company. But it's "not quite there" as a SOTA OSS model, imho. Maybe with 3.5 or 4, especially if they add vision or omni ;)

but these might be interesting:

Qwen/Qwen3.5-397B-A17B-GPTQ-Int4

Qwen/Qwen3.5-122B-A10B-GPTQ-Int4

Run very nicely on Amperes + vLLM, and the models are definitely up there.

If you can run Qwen 397B, I would very strongly encourage you to try minimax M2.5 at around @q4 (or whatever fits), or (if you enjoy "smart but verbose" reasoning) Step-3.5-Flash. I'm really torn between these two as the best that will run on my hardware at the moment: both are excellent with OpenCode. If you have a ton of VRAM; if you're running 400B models @q8 or more, you could try Kimi at a very low quant.

I can't run ~400B models (in GPU). I have 4x3090 (for 96GB) + 1x 3070 (another 8GB, which I normally use for other things like embedding/tts/asr/ocr). But in 96GB, I can run Minimax M2.5 at Q2_XSS, and Qwen3.5-122B at higher quants is a bad joke by comparison. Only Step-3.5-Flash has been in the same league.

I honestly don't get the fuss about Qwen: I used it for a short time just prior to QwQ, but I've tried every Qwen reasoning model since QwQ, and they've ALL had serious quality issues like looping (well known) but also "pseudo-reasoning": a lot of "fake" reasoning phrases that sound like thought, but aren't actually coherent/relevant in the situation, so just waste tokens and confuse the outcome. For example, "ah-hah! it's X", when it's clearly not X at all. Sadly this doesn't get benchmarked much. OckBench exists, but doesn't cover many models. Qwen 2.5 scored 2.3% "useful" reasoning (the rest being wasted or even misleading tokens) by one analysis of its reasoning, according to Liquid AI: https://www.linkedin.com/posts/maxime-labonne_most-reasoning-steps-in-llms-are-just-activity-7437484983092588545-8L31. Step-3.5-Flash is also guilty of this sometimes, but not as much. MiniMax is much more coherent, even at Q2, and is much less wasteful of CoT tokens in general.

The one thing that qwen offers is good, accurate vision support, but I'm working on running that @Q4 on my 8GB card with reasoning disabled, for vision alone, and routing image-to-text requests through a proxy such that the rest goes to Minimax/Step.

BTW: both minimax m2.7 and Step-3.6-Flash are due soon (a few weeks). I'd encourage you to take a look at those when they finally drop, too. Also, minimax M3 should be multimodal.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment