Instructions to use nvidia/Nemotron-Cascade-2-30B-A3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Nemotron-Cascade-2-30B-A3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/Nemotron-Cascade-2-30B-A3B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Cascade-2-30B-A3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Cascade-2-30B-A3B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nvidia/Nemotron-Cascade-2-30B-A3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Nemotron-Cascade-2-30B-A3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Cascade-2-30B-A3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/Nemotron-Cascade-2-30B-A3B

SGLang

How to use nvidia/Nemotron-Cascade-2-30B-A3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Nemotron-Cascade-2-30B-A3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Cascade-2-30B-A3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Nemotron-Cascade-2-30B-A3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Cascade-2-30B-A3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/Nemotron-Cascade-2-30B-A3B with Docker Model Runner:
```
docker model run hf.co/nvidia/Nemotron-Cascade-2-30B-A3B
```

pruned version

#16

by pirola - opened Mar 26

Discussion

pirola

Mar 26

Hi. As the owner of a small giant RTX5080, I need a pruned + high quality DAQ NVFP4 version of this model to be able to run NemoClaw locally. It looks like all small models releases from nvidia focus on the RTX5090 only... please also include the 16 Gi VRAM boards as well.

JDWarner

Mar 26

16GB is going to be impossible for this model without REAP or similar removal/merging of experts. The best NVFP4 quant available (which uses the official recipe) is close to 20GB: https://huggingface.co/chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4. You could try that or a 4-bit GGUF with CPU/RAM offloading, though, and may be pleasantly surprised.

Running natively on your card, the best model in this series is likely the dense Mamba hybrid 4B Nemotron-3 at FP8 https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8 - and for NemoClaw style work, consider also FP8 or 4-bit quants of Qwen3.5 9B.

ychenNLP

NVIDIA org Mar 27

check out this setup from Sudo su:

"i pointed hermes agent at nvidia's nemotron cascade 2 30B-A3B on a single RTX 3090 24GB. IQ4_XS quant by bartowski, 187 tok/s, 625K context. had it discover its own hardware, create an identity file, then build a full GPU marketplace UI from a single prompt."

https://x.com/sudoingX/status/2037512256599306578?s=20

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment