Instructions to use chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="chankhavu/Nemotron-Cascade-2-30B-A3B-FP8", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("chankhavu/Nemotron-Cascade-2-30B-A3B-FP8", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("chankhavu/Nemotron-Cascade-2-30B-A3B-FP8", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "chankhavu/Nemotron-Cascade-2-30B-A3B-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "chankhavu/Nemotron-Cascade-2-30B-A3B-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/chankhavu/Nemotron-Cascade-2-30B-A3B-FP8
- SGLang
How to use chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "chankhavu/Nemotron-Cascade-2-30B-A3B-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "chankhavu/Nemotron-Cascade-2-30B-A3B-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "chankhavu/Nemotron-Cascade-2-30B-A3B-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "chankhavu/Nemotron-Cascade-2-30B-A3B-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 with Docker Model Runner:
docker model run hf.co/chankhavu/Nemotron-Cascade-2-30B-A3B-FP8
Nemotron-Cascade-2-30B-A3B-FP8
FP8 W8A8 quantization of Nemotron-Cascade-2-30B-A3B — a 32B hybrid Mamba-MoE reasoning model (3.2B active). Quantized with NVIDIA ModelOpt using the same selective quantization recipe as NVIDIA's official Nano FP8: MoE experts and Mamba GEMMs in FP8, attention and sensitive layers in BF16, KV cache in FP8.
Benchmarks
Calculated using NVIDIA-NeMo/Evaluator with config from Nemotron-3-Super-120B's eval config:
| Benchmark | Nemotron-Cascade-2-30B-A3B (reproduced results) |
Nemotron-Cascade-2-30B-A3B-FP8 (this model) |
|---|---|---|
| AIME 2025 (avg@8) | 98.8 | 96.7 |
| AIME 2026 (avg@8) | 94.2 | 95.0 |
| HMMT Feb 2025 (avg@8) | 92.9 | 93.8 |
With the low sample count (8 rollouts per problem), a deviation of ±2% accross runs is expected. Safe to say BF16 and FP8 variants are equivalent in reasoning performance.
Quantization Details
- Method: FP8 Post-Training Quantization (PTQ), without Quantization-Aware Distillation (QAD)
- Format: FP8 E4M3 — static per-tensor weight scaling, dynamic per-token activation scaling at inference
- KV cache: FP8
- Tooling: NVIDIA ModelOpt
Selective Quantization Recipe
Follows the Nano-architecture selective quantization recipe from the Nemotron 3 Nano Technical Report (Section 4). Same recipe as NVIDIA's official FP8 checkpoint. Sensitive components are kept in higher precision:
| Component | Precision | Rationale |
|---|---|---|
| MoE expert GEMMs (routed + shared) | FP8 | All 23 MoE layers, 128 routed + 2 shared experts each |
| Mamba GEMMs (non-adjacent) | FP8 | 17 of 23 Mamba layers |
| Attention layers (all 6) | BF16 | Most sensitive — kept BF16 per NVIDIA sensitivity analysis |
| Mamba layers adjacent to attention (6) | BF16 | Layers {4, 11, 18, 25, 32, 41} — found sensitive in ablations |
| Mamba 1D conv | BF16 | All layers |
| Router gates | FP32 | Routing precision must not degrade |
| Embeddings & lm_head | BF16 | Not quantized |
| KV cache | FP8 | All 6 attention layers |
Calibration
- Dataset: 4,000 samples from nvidia/Nemotron-Cascade-2-SFT-Data
- Domain mix: math (1000), swe (900), terminal_agent (500), science (500), chat (400), conversational_agent (300), instruction_following (300), safety (100)
- Sequence length: Up to 12,288 tokens (no padding, natural length per sample)
Usage
SGLang
python -m sglang.launch_server \
--model chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
--trust-remote-code \
--tool-call-parser qwen3_coder \
--reasoning-parser nano_v3
vLLM
vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
--mamba_ssm_cache_dtype float32 \
--max-model-len 262144 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser nemotron_v3
Acknowledgments
- Quantization recipe based on the Nemotron 3 Nano Technical Report
- Quantized with NVIDIA ModelOpt
- Downloads last month
- 1,538
Model tree for chankhavu/Nemotron-Cascade-2-30B-A3B-FP8
Base model
nvidia/Nemotron-Cascade-2-30B-A3B