Instructions to use nvidia/Nemotron-Cascade-2-30B-A3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/Nemotron-Cascade-2-30B-A3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nvidia/Nemotron-Cascade-2-30B-A3B", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Cascade-2-30B-A3B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Cascade-2-30B-A3B", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nvidia/Nemotron-Cascade-2-30B-A3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/Nemotron-Cascade-2-30B-A3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Nemotron-Cascade-2-30B-A3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nvidia/Nemotron-Cascade-2-30B-A3B
- SGLang
How to use nvidia/Nemotron-Cascade-2-30B-A3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/Nemotron-Cascade-2-30B-A3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Nemotron-Cascade-2-30B-A3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/Nemotron-Cascade-2-30B-A3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Nemotron-Cascade-2-30B-A3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nvidia/Nemotron-Cascade-2-30B-A3B with Docker Model Runner:
docker model run hf.co/nvidia/Nemotron-Cascade-2-30B-A3B
pruned version
Hi. As the owner of a small giant RTX5080, I need a pruned + high quality DAQ NVFP4 version of this model to be able to run NemoClaw locally. It looks like all small models releases from nvidia focus on the RTX5090 only... please also include the 16 Gi VRAM boards as well.
16GB is going to be impossible for this model without REAP or similar removal/merging of experts. The best NVFP4 quant available (which uses the official recipe) is close to 20GB: https://huggingface.co/chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4. You could try that or a 4-bit GGUF with CPU/RAM offloading, though, and may be pleasantly surprised.
Running natively on your card, the best model in this series is likely the dense Mamba hybrid 4B Nemotron-3 at FP8 https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8 - and for NemoClaw style work, consider also FP8 or 4-bit quants of Qwen3.5 9B.
check out this setup from Sudo su:
"i pointed hermes agent at nvidia's nemotron cascade 2 30B-A3B on a single RTX 3090 24GB. IQ4_XS quant by bartowski, 187 tok/s, 625K context. had it discover its own hardware, create an identity file, then build a full GPU marketplace UI from a single prompt."