Text Generation
Transformers
Safetensors
PyTorch
nvidia
nemotron-3
latent-moe
mtp
conversational
8-bit precision
dmax123 commited on
Commit
92e0cb5
·
verified ·
1 Parent(s): de9a676

Upload 5 files

Browse files
Files changed (5) hide show
  1. README.md +80 -44
  2. bias.md +10 -0
  3. explainability.md +14 -0
  4. privacy.md +5 -0
  5. safety.md +9 -0
README.md CHANGED
@@ -55,9 +55,9 @@ track_downloads: true
55
  </a>
56
  </div>
57
 
58
- <div align="center" style="line-height: 1;">
59
  <a href="https://openmdw.ai/license/1-1/" style="margin: 2px;">
60
- <img alt="License" src="https://img.shields.io/badge/License-OpenMDW-1.1-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/>
61
  </a>
62
  </div>
63
 
@@ -74,8 +74,8 @@ track_downloads: true
74
  | **Supported Languages** | English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Korean, Brazilian Portuguese, and Chinese |
75
  | **Best For** | Frontier reasoning, complex agentic workflows, long-context analysis, tool use, multilingual reasoning, high-stakes RAG |
76
  | **Reasoning Mode** | Configurable on/off via chat template (`enable_thinking=True/False`) |
77
- | **License** | [NVIDIA Nemotron Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/) |
78
- | **Release Date** | June 6th 2026 |
79
 
80
 
81
  ## Quick Start
@@ -105,7 +105,7 @@ NVIDIA Nemotron™ is a family of open models with open weights, training data,
105
 
106
  The model employs a hybrid **Latent Mixture-of-Experts (LatentMoE)** architecture, utilizing interleaved Mamba-2 and MoE layers, along with select Attention layers. Like the Super model, the Ultra model incorporates **Multi-Token Prediction (MTP)** layers for faster text generation and improved quality, and it is trained using an **NVFP4** pre-training recipe to maximize compute efficiency. The model has **55B active parameters** and **550B parameters in total**.
107
 
108
- The supported languages include: English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Korean, Brazilian Portuguese, and Chinese
109
 
110
  This model is ready for commercial and non-commercial use.
111
 
@@ -113,11 +113,52 @@ This model is ready for commercial and non-commercial use.
113
 
114
  **Governing Download Terms:** Use of this model is governed by the [OpenMDW-1.1 model license](https://openmdw.ai/license/1-1/).
115
 
116
- **Governing Download Terms with NIM:** The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and [Product-Specific Terms for AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/). Use of this model is governed by the [NVIDIA Nemotron Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/).
117
-
118
  ### Benchmarks
119
 
120
- TBD
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
 
122
  All evaluation results were collected via [Nemo Evaluator SDK](https://github.com/NVIDIA-NeMo/Evaluator). We used three main evaluation harnesses: [Nemo Gym](https://github.com/NVIDIA-NeMo/Gym), [Nemo Skills](https://github.com/NVIDIA-NeMo/Skills), and [Harbor](https://github.com/harbor-framework/harbor) with extended sandboxing support via AWS ECS on Nemo Evaluator. In addition, the evaluations also used dedicated open-source packaged containers for ScaleAI Multi Challenge Multi Turn Instruction Following and KernelBench. For reproducibility purposes, more details on the evaluation settings and pinned containers can be found in the [Nemo Evaluator SDK examples folder](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/examples/nemotron/nemotron-3-ultra) and the [reproducibility tutorial for Nemotron 3 Ultra](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/examples/nemotron/nemotron-3-ultra/reproducibility.md).
123
 
@@ -131,8 +172,7 @@ NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 is a frontier-scale general purpose reas
131
 
132
  ### Release Date
133
 
134
- NGC - 06/04/2026 via [NGC]()
135
- Hugging Face - 06/04/2026 via [Hugging Face]()
136
 
137
  ## Reference(s)
138
 
@@ -160,30 +200,26 @@ Stage 2: Supervised Fine-Tuning
160
 
161
  * The model was further fine-tuned on synthetic code, math, science, tool calling, instruction following, structured outputs, and general knowledge data. This stage incorporated data designed to support long-range retrieval and multi-document aggregation. All datasets are disclosed in the [Training and Evaluation Datasets](#training-and-evaluation-datasets) section of this document. Major portions of the fine-tuning corpus are released in the [Nemotron-Post-Training-v3](https://huggingface.co/collections/nvidia/nemotron-post-training-v3) collection. [Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) is one of the libraries used to prepare these corpora.
162
 
163
- Stage 3: Multi-Domain On-Policy Distillation (MOPD)
164
-
165
- * The model underwent **Multi-Domain On-Policy Distillation (MOPD)** to improve reasoning across many task types while staying efficient. This technique uses strong teacher models to guide training on the model's own generated attempts (on-policy rollouts), helping recover accuracy and improve performance across coding, math, instruction following, tool use, and agentic workflows. By distilling teacher signal onto the student's own trajectories rather than offline traces, MOPD better aligns the student's behavior with what it would actually produce at inference time, yielding stronger gains than purely off-policy distillation.
166
-
167
- Stage 4: Reinforcement Learning
168
 
169
  * The model underwent multi-environment reinforcement learning using asynchronous GRPO (Group Relative Policy Optimization) across math, code, science, instruction following, multi-step tool use, multi-turn conversations, and structured output environments. It utilized an asynchronous RL architecture that fully decouples training from inference across separate GPU devices, leveraging in-flight weight updates and MTP to accelerate rollout generation. Conversational quality was further refined through RLHF. All datasets are disclosed in the [Training and Evaluation Datasets](#training-and-evaluation-datasets) section of this document. The RL environments and datasets are released as part of [NeMo Gym](https://github.com/NVIDIA-NeMo/Gym).
170
  * Software used for reinforcement learning: [NeMo RL](https://github.com/NVIDIA-NeMo/RL), [NeMo Gym](https://github.com/NVIDIA-NeMo/Gym)
171
 
172
- NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 model is a result of the above work.
173
 
174
- The end-to-end training recipe is available in the [NVIDIA Nemotron Developer Repository](https://github.com/NVIDIA-NeMo/Nemotron). Evaluation results can be replicated using the [NeMo Evaluator SDK](https://github.com/NVIDIA-NeMo/Evaluator). [Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) is one of the libraries used to prepare the pre and post training datasets. More details on the datasets and synthetic data generation methods can be found in the technical report [NVIDIA Nemotron 3 Ultra Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf).
175
 
176
- ## Computational Load (Internal Only: For NVIDIA Models Only; please add as an HTML comment and remove the fields below from the published model card)
177
 
178
- **Cumulative Compute:** Pre-Training (7.18e+24 FLOPS) - Post-Training (1.15e+23 FLOPs)
179
- **Estimated Energy and Emissions for Model Training:** Pre-Training (11,890,852 kWh, 3841) - Post-Training (299,600 kWh, 102)
 
180
 
181
  ## Input
182
 
183
  - **Input Type(s):** Text
184
  - **Input Format(s):** String
185
  - **Input Parameters:** One-Dimensional (1D): Sequences
186
- - **Other Properties Related to Input:** Maximum context length up to 1M tokens. Supported languages include: English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Korean, Brazilian Portuguese, and Chinese
187
 
188
  ## Output
189
 
@@ -196,7 +232,7 @@ Our AI models are designed and optimized to run on NVIDIA GPU-accelerated system
196
 
197
  ## Software Integration
198
 
199
- - Runtime Engine(s): NeMo 25.11.01
200
  - Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere - A100; NVIDIA Blackwell; NVIDIA Hopper - H100-80GB
201
  - Operating System(s): Linux
202
 
@@ -239,7 +275,7 @@ ray status --address=${RAY_HEAD_IP}:${RAY_PORT}
239
 
240
  ### **vLLM**
241
 
242
- **Recommended container:** `vllm/vllm-openai:v0.21.0` (or `v0.20.1`)
243
 
244
  For more detailed information, please see this cookbook.
245
 
@@ -378,7 +414,7 @@ docker run -d --name nemotron-ultra-sglang \
378
 
379
  ### **TRT-LLM**
380
 
381
- **Container:** `docker pull nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc16`
382
 
383
  For more detailed information, please see this cookbook.
384
 
@@ -716,15 +752,15 @@ print(result)
716
 
717
  # Training
718
 
719
- **Data Modality:** Text
720
- **The total size:** 53.8 TiB (14.8 trillion tokens)
721
- **Total number of datasets:** 226
722
- **Dataset partition:** *Training [100%], testing [0%], validation [0%]*
723
- **Time period for training data collection:** 2013 to 2026
724
- **Time period for testing data collection:** 2013 to 2026
725
- **Time period for validation data collection:** 2013 to 2026
726
- **Data Collection Method by dataset:** Hybrid: Automated, Human, Synthetic
727
- **Labeling Method by dataset:** Hybrid: Automated, Human, Synthetic
728
 
729
  NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 is pre-trained on a large corpus of high-quality curated and synthetically-generated data. It is trained in the English language, as well as 11 other languages and 43 programming languages. Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracy. The model was pre-trained for approximately 20 trillion tokens.
730
 
@@ -820,7 +856,7 @@ The foundation of the model is trained on the **Nemotron-3-Ultra** corpus, compr
820
 
821
  The English Common Crawl data was downloaded from the Common Crawl Foundation (see their FAQ for details on their crawling) and includes the snapshots CC-MAIN-2013-20 through CC-MAIN-2025-13. The data was subsequently deduplicated and filtered in various ways described in the Nemotron-CC paper. Additionally, we extracted data for fifteen languages from the following three Common Crawl snapshots: CC-MAIN-2024-51, CC-MAIN-2025-08, CC-MAIN-2025-18. The fifteen languages included were Arabic, Chinese, Danish, Dutch, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swedish, and Thai. As we did not have reliable multilingual model-based quality classifiers available, we applied just heuristic filtering instead—similar to what we did for lower quality English data in the Nemotron-CC pipeline, but selectively removing some filters for some languages that did not work well. Deduplication was done in the same way as for Nemotron-CC.
822
 
823
- The GitHub Crawl was collected using the GitHub REST API and the Amazon S3 API. Each crawl was operated in accordance with the rate limits set by its respective source, either GitHub or S3. We collect raw source code and subsequently remove any having a license which does not exist in our permissive-license set (for additional details, refer to the [technical report](https://arxiv.org/abs/2512.20848)).
824
 
825
  | Dataset | Modality | Dataset Size | Collection Period | Collecting Organisation |
826
  | :---- | :---- | :---- | :---- | :---- |
@@ -1037,23 +1073,23 @@ The following table depicts our sample distribution.
1037
 
1038
  ## Evaluation Datasets:
1039
 
1040
- ** Data Collection Method by dataset <br>
1041
  * Hybrid: Automated, Human, Synthetic
1042
 
1043
- ** Labeling Method by dataset <br>
1044
  * Hybrid: Automated, Human, Synthetic
1045
 
1046
- ** Properties:** This corpus comprises a mix of high-quality standard benchmarks and test suites for modern agentic AI as outlined in the benchmark section of the model card.
1047
 
1048
  ## Testing Datasets:
1049
 
1050
- ** Data Collection Method by dataset <br>
1051
  * Hybrid: Automated, Human, Synthetic
1052
 
1053
- ** Labeling Method by dataset <br>
1054
  * Hybrid: Automated, Human, Synthetic
1055
 
1056
- ** Properties:** This corpus comprises a mix of high-quality standard benchmarks and test suites for modern agentic AI as outlined in the benchmark section of the model card.
1057
 
1058
  </details>
1059
 
@@ -1063,7 +1099,7 @@ The following table depicts our sample distribution.
1063
  * **Test Hardware:**
1064
  * NVIDIA Hopper
1065
  * H100
1066
- * 1-8x H200
1067
  * NVIDIA Grace Blackwell
1068
  * GB200
1069
  * GB300
@@ -1084,11 +1120,11 @@ Please report model quality, risk, security vulnerabilities or NVIDIA AI Concern
1084
  ## Citation
1085
 
1086
  ```bibtex
1087
- @misc{nvidia_nemotron_3_2025,
1088
- title = {NVIDIA Nemotron 3: Efficient and Open Intelligence},
1089
  author = {{NVIDIA}},
1090
  year = {2025},
1091
- url = {https://arxiv.org/abs/2512.20856},
1092
  note = {White Paper}
1093
  }
1094
  ```
 
55
  </a>
56
  </div>
57
 
58
+ <div style="text-align: center; line-height: 1;">
59
  <a href="https://openmdw.ai/license/1-1/" style="margin: 2px;">
60
+ <img alt="License" src="https://img.shields.io/badge/License-OpenMDW--1.1-f5de53" style="display: inline-block; vertical-align: middle;"/>
61
  </a>
62
  </div>
63
 
 
74
  | **Supported Languages** | English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Korean, Brazilian Portuguese, and Chinese |
75
  | **Best For** | Frontier reasoning, complex agentic workflows, long-context analysis, tool use, multilingual reasoning, high-stakes RAG |
76
  | **Reasoning Mode** | Configurable on/off via chat template (`enable_thinking=True/False`) |
77
+ | **License** | [OpenMDW License Agreement, version 1.1](https://raw.githubusercontent.com/OpenMDW/OpenMDW/refs/heads/main/1.1/LICENSE.OpenMDW-1.1) |
78
+ | **Release Date** | June 4, 2026 |
79
 
80
 
81
  ## Quick Start
 
105
 
106
  The model employs a hybrid **Latent Mixture-of-Experts (LatentMoE)** architecture, utilizing interleaved Mamba-2 and MoE layers, along with select Attention layers. Like the Super model, the Ultra model incorporates **Multi-Token Prediction (MTP)** layers for faster text generation and improved quality, and it is trained using an **NVFP4** pre-training recipe to maximize compute efficiency. The model has **55B active parameters** and **550B parameters in total**.
107
 
108
+ The supported languages include: English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Korean, Brazilian Portuguese, and Chinese.
109
 
110
  This model is ready for commercial and non-commercial use.
111
 
 
113
 
114
  **Governing Download Terms:** Use of this model is governed by the [OpenMDW-1.1 model license](https://openmdw.ai/license/1-1/).
115
 
 
 
116
  ### Benchmarks
117
 
118
+ | Benchmark | N-3-Ultra <br> 550B-A55B | MiniMax-2.7 <br> 230B-A10B | GLM-5.1 <br> 744B-A40B | Kimi-K2.6 <br> 1T-A32B | Qwen-3.5 <br> 397B-17B | DS-v4-Pro <br> 1.6T-A49B | DS-v4-Flash <br> 284B-A13B |
119
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
120
+ | **Agentic** | | | | | | | |
121
+ | Terminal Bench 2.1 | 56.4 | 55.5 | 59.3 | 67.2 | 49.9 | 49.2 | 54.2 |
122
+ | GDPVal | 46.7 | 47.6 | 54.7 | 50.4 | 34.6 | 54.6 | 50.2 |
123
+ | SWE-Bench Verified | 71.9 | 72.2 | 73.8 | 69.5 | 69.9 | 74.0 | 72.4 |
124
+ | SWE-Bench Multilingual | 67.7 | 69.2 | 73.8 | 65.9 | 67.7 | 71.9 | 72.1 |
125
+ | ProfBench (Search) | 56.0 | 52.0 | 46.0 | 56.0 | 53.0 | 59.9 | 57.0 |
126
+ | PinchBench | 90.0 | 77.6 | 81.2 | 90.2 | 86.6 | 88.6 | 91.3 |
127
+ | TauBench V3 | | | | | | | |
128
+ | &nbsp;&nbsp;Airline | 81.5 | 75.3 | 85.0 | 85.8 | 76.5 | 80.8 | 80.8 |
129
+ | &nbsp;&nbsp;Retail | 86.4 | 84.9 | 84.1 | 82.9 | 88.5 | 88.9 | 89.1 |
130
+ | &nbsp;&nbsp;Telecom | 92.9 | 89.6 | 96.9 | 97.8 | 98.0 | 96.3 | 98.3 |
131
+ | &nbsp;&nbsp;Banking | 22.6 | 14.6 | 12.8 | 23.1 | 20.9 | 25.9 | 26.7 |
132
+ | &nbsp;&nbsp;Average | 70.9 | 66.1 | 69.7 | 72.4 | 71.0 | 73.2 | 73.7 |
133
+ | BrowseComp | 44.4 | 54.1 | 59.4 | 61.3 | 40.5 | 59.4 | 46.9 |
134
+ | Vals.ai Financial Agent 1.1 | | | | | | | |
135
+ | &nbsp;&nbsp;without web search | 60.1 | 51.3 | 60.2 | 54.0 | 61.3 | 58.9 | 58.4 |
136
+ | &nbsp;&nbsp;with web search | 53.7 | 50.5 | 60.7 | 58.8 | 59.0 | 62.3 | 60.1 |
137
+ | **Reasoning and Knowledge** | | | | | | | |
138
+ | IOI 2025 | 570.0 | -- | 456.5 | 585.0 | 441.3 | 580.1 | -- |
139
+ | LiveCodeBench (v6) | 89.0 | 77.2 | 85.7 | 90.2 | 79.3 | 92.5 | 90.9 |
140
+ | IMOAnswerBench (no tools) | 88.6 | 68.3 | 86.8 | 91.1 | 83.1 | 93.0 | 91.1 |
141
+ | IMOAnswerBench (with tools) | 92.3 | 75.1 | 91.1 | 93.71 | 84.51 | 85.4 | 89.6 |
142
+ | Apex-Shortlist (no tools) | 74.9 | 28.9 | 71.1 | 77.4 | 61.4 | 85.8 | 82.4 |
143
+ | Apex-Shortlist (with tools) | 84.8 | 51.9 | 79.0 | 73.2 | 60.4 | 86.5 | 82.0 |
144
+ | GPQA (no tools) | 87.0 | 86.6 | 86.1 | 91.0 | 87.1 | 87.8 | 88.5 |
145
+ | SciCode (subtask) | 44.6 | 38.3 | 47.7 | 52.0 | 48.0 | 50.5 | 48.2 |
146
+ | HLE (no tools) | 26.7 | 23.1 | 27.2 | 34.8 | 28.5 | 37.7 | 32.2 |
147
+ | HLE (with tools) | 37.4 | -- | 50.4 | 54.0 | 48.3 | 48.2 | 45.1 |
148
+ | CritPt (no tools) | 3.1 | 0.6 | 3.7 | 9.1 | 2.4 | 14.0 | 10.6 |
149
+ | MMLU-Pro | 86.8 | 81.9 | 85.9 | 88.1 | 88.3 | 87.5 | 86.4 |
150
+ | OmniScience Accuracy | 24.1 | 20.5 | 31.3 | 35.5 | 35.9 | 46.8 | 39.9 |
151
+ | OmniScience Non-Hallucination | 78.7 | 74.4 | 66.8 | 67.1 | 7.4 | 5.7 | 2.8 |
152
+ | **Chat & Instruction Following** | | | | | | | |
153
+ | IFBench (prompt loose) | 81.7 | 74.6 | 76.6 | 73.7 | 78.2 | 79.1 | 82.0 |
154
+ | Multi-Challenge | 63.8 | 42.5 | 63.0 | 63.1 | 63.9 | 64.1 | 63.5 |
155
+ | **Long Context** | | | | | | | |
156
+ | AA-LCR | 65.4 | 69.8 | 66.9 | 70.2 | 68.3 | 67.3 | 62.7 |
157
+ | RULER (1M) | 94.7 | -- | -- | -- | 90.1 | 94.2 | 87.7 |
158
+ | Longbench v2 (≤ 1M) | 61.9 | -- | -- | -- | 68.9 | 62.1 | 57.0 |
159
+ | **Multilingual** | | | | | | | |
160
+ | MMLU-ProX (avg en/de/fr/es/it/ja/zh/hi/pt/ko) | 83.0 | 78.4 | 85.8 | 85.0 | 86.4 | 85.6 | 84.3 |
161
+ | WMT24++ (en→xx) | 83.7 | 82.8 | 84.4 | 84.5 | 86.8 | 85.9 | 85.9 |
162
 
163
  All evaluation results were collected via [Nemo Evaluator SDK](https://github.com/NVIDIA-NeMo/Evaluator). We used three main evaluation harnesses: [Nemo Gym](https://github.com/NVIDIA-NeMo/Gym), [Nemo Skills](https://github.com/NVIDIA-NeMo/Skills), and [Harbor](https://github.com/harbor-framework/harbor) with extended sandboxing support via AWS ECS on Nemo Evaluator. In addition, the evaluations also used dedicated open-source packaged containers for ScaleAI Multi Challenge Multi Turn Instruction Following and KernelBench. For reproducibility purposes, more details on the evaluation settings and pinned containers can be found in the [Nemo Evaluator SDK examples folder](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/examples/nemotron/nemotron-3-ultra) and the [reproducibility tutorial for Nemotron 3 Ultra](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/examples/nemotron/nemotron-3-ultra/reproducibility.md).
164
 
 
172
 
173
  ### Release Date
174
 
175
+ Hugging Face - 06/04/2026 via [Hugging Face](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4)
 
176
 
177
  ## Reference(s)
178
 
 
200
 
201
  * The model was further fine-tuned on synthetic code, math, science, tool calling, instruction following, structured outputs, and general knowledge data. This stage incorporated data designed to support long-range retrieval and multi-document aggregation. All datasets are disclosed in the [Training and Evaluation Datasets](#training-and-evaluation-datasets) section of this document. Major portions of the fine-tuning corpus are released in the [Nemotron-Post-Training-v3](https://huggingface.co/collections/nvidia/nemotron-post-training-v3) collection. [Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) is one of the libraries used to prepare these corpora.
202
 
203
+ Stage 3: Reinforcement Learning
 
 
 
 
204
 
205
  * The model underwent multi-environment reinforcement learning using asynchronous GRPO (Group Relative Policy Optimization) across math, code, science, instruction following, multi-step tool use, multi-turn conversations, and structured output environments. It utilized an asynchronous RL architecture that fully decouples training from inference across separate GPU devices, leveraging in-flight weight updates and MTP to accelerate rollout generation. Conversational quality was further refined through RLHF. All datasets are disclosed in the [Training and Evaluation Datasets](#training-and-evaluation-datasets) section of this document. The RL environments and datasets are released as part of [NeMo Gym](https://github.com/NVIDIA-NeMo/Gym).
206
  * Software used for reinforcement learning: [NeMo RL](https://github.com/NVIDIA-NeMo/RL), [NeMo Gym](https://github.com/NVIDIA-NeMo/Gym)
207
 
208
+ Stage 4: Multi-Domain On-Policy Distillation (MOPD)
209
 
210
+ * The model underwent **Multi-Domain On-Policy Distillation (MOPD)** to improve reasoning across many task types while staying efficient. This technique uses strong teacher models to guide training on the model's own generated attempts (on-policy rollouts), helping recover accuracy and improve performance across coding, math, instruction following, tool use, and agentic workflows. By distilling teacher signal onto the student's own trajectories rather than offline traces, MOPD better aligns the student's behavior with what it would actually produce at inference time, yielding stronger gains than purely off-policy distillation.
211
 
 
212
 
213
+ NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 model is a result of the above work.
214
+
215
+ The end-to-end training recipe is available in the [NVIDIA Nemotron Developer Repository](https://github.com/NVIDIA-NeMo/Nemotron). Evaluation results can be replicated using the [NeMo Evaluator SDK](https://github.com/NVIDIA-NeMo/Evaluator). [Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) is one of the libraries used to prepare the pre and post training datasets. More details on the datasets and synthetic data generation methods can be found in the technical report [NVIDIA Nemotron 3 Ultra Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf).
216
 
217
  ## Input
218
 
219
  - **Input Type(s):** Text
220
  - **Input Format(s):** String
221
  - **Input Parameters:** One-Dimensional (1D): Sequences
222
+ - **Other Properties Related to Input:** Maximum context length up to 1M tokens. Supported languages include: English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Korean, Brazilian Portuguese, and Chinese.
223
 
224
  ## Output
225
 
 
232
 
233
  ## Software Integration
234
 
235
+ - Runtime Engine(s): NeMo 26.04.01
236
  - Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere - A100; NVIDIA Blackwell; NVIDIA Hopper - H100-80GB
237
  - Operating System(s): Linux
238
 
 
275
 
276
  ### **vLLM**
277
 
278
+ **Recommended container:** `vllm/vllm-openai:v0.22.0`
279
 
280
  For more detailed information, please see this cookbook.
281
 
 
414
 
415
  ### **TRT-LLM**
416
 
417
+ **Container:** `docker pull nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc17`
418
 
419
  For more detailed information, please see this cookbook.
420
 
 
752
 
753
  # Training
754
 
755
+ **Data Modality:** Text
756
+ **The total size:** 53.8 TiB (14.8 trillion tokens)
757
+ **Total number of datasets:** 226
758
+ **Dataset partition:** *Training [100%], testing [0%], validation [0%]*
759
+ **Time period for training data collection:** 2013 to 2026
760
+ **Time period for testing data collection:** 2013 to 2026
761
+ **Time period for validation data collection:** 2013 to 2026
762
+ **Data Collection Method by dataset:** Hybrid: Automated, Human, Synthetic
763
+ **Labeling Method by dataset:** Hybrid: Automated, Human, Synthetic
764
 
765
  NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 is pre-trained on a large corpus of high-quality curated and synthetically-generated data. It is trained in the English language, as well as 11 other languages and 43 programming languages. Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracy. The model was pre-trained for approximately 20 trillion tokens.
766
 
 
856
 
857
  The English Common Crawl data was downloaded from the Common Crawl Foundation (see their FAQ for details on their crawling) and includes the snapshots CC-MAIN-2013-20 through CC-MAIN-2025-13. The data was subsequently deduplicated and filtered in various ways described in the Nemotron-CC paper. Additionally, we extracted data for fifteen languages from the following three Common Crawl snapshots: CC-MAIN-2024-51, CC-MAIN-2025-08, CC-MAIN-2025-18. The fifteen languages included were Arabic, Chinese, Danish, Dutch, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swedish, and Thai. As we did not have reliable multilingual model-based quality classifiers available, we applied just heuristic filtering instead—similar to what we did for lower quality English data in the Nemotron-CC pipeline, but selectively removing some filters for some languages that did not work well. Deduplication was done in the same way as for Nemotron-CC.
858
 
859
+ The GitHub Crawl was collected using the GitHub REST API and the Amazon S3 API. Each crawl was operated in accordance with the rate limits set by its respective source, either GitHub or S3. We collect raw source code and subsequently remove any having a license which does not exist in our permissive-license set (for additional details, refer to the [technical report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf)).
860
 
861
  | Dataset | Modality | Dataset Size | Collection Period | Collecting Organisation |
862
  | :---- | :---- | :---- | :---- | :---- |
 
1073
 
1074
  ## Evaluation Datasets:
1075
 
1076
+ **Data Collection Method by dataset** <br>
1077
  * Hybrid: Automated, Human, Synthetic
1078
 
1079
+ **Labeling Method by dataset** <br>
1080
  * Hybrid: Automated, Human, Synthetic
1081
 
1082
+ **Properties:** This corpus comprises a mix of high-quality standard benchmarks and test suites for modern agentic AI as outlined in the benchmark section of the model card.
1083
 
1084
  ## Testing Datasets:
1085
 
1086
+ **Data Collection Method by dataset** <br>
1087
  * Hybrid: Automated, Human, Synthetic
1088
 
1089
+ **Labeling Method by dataset** <br>
1090
  * Hybrid: Automated, Human, Synthetic
1091
 
1092
+ **Properties:** This corpus comprises a mix of high-quality standard benchmarks and test suites for modern agentic AI as outlined in the benchmark section of the model card.
1093
 
1094
  </details>
1095
 
 
1099
  * **Test Hardware:**
1100
  * NVIDIA Hopper
1101
  * H100
1102
+ * H200
1103
  * NVIDIA Grace Blackwell
1104
  * GB200
1105
  * GB300
 
1120
  ## Citation
1121
 
1122
  ```bibtex
1123
+ @misc{nvidia_nemotron_3_ultra_2026,
1124
+ title = {Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning},
1125
  author = {{NVIDIA}},
1126
  year = {2025},
1127
+ url = {https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf},
1128
  note = {White Paper}
1129
  }
1130
  ```
bias.md ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ | Field | Response |
2
+ | :---- | :---- |
3
+ | Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None |
4
+ | Bias Metric (If Measured): | [BBQ Accuracy Scores in Ambiguous Contexts](https://github.com/nyu-mll/BBQ/) |
5
+ | Which characteristic (feature) show(s) the greatest difference in performance?: | The model shows high variance in the characteristics when it is used with a high temperature. |
6
+ | Measures taken to mitigate against unwanted bias: | Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) employed to calibrate the model’s reasoning capabilities to maintain logical consistency and appropriate complexity when interacting with or interpreting data from diverse age demographics. |
7
+ | If using internal data, description of methods implemented in data acquisition or processing, if any, to address the prevalence of identifiable biases in the training, testing, and validation data: | The training datasets contain a large amount of synthetic data generated by LLMs. We manually curated prompts. |
8
+ | Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: | [BBQ](https://github.com/nyu-mll/BBQ/) |
9
+ | Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: | These datasets, such as web-scraped finance reasoning data derived from SEC EDGAR filings, science and math problem datasets, OpenResearcher/source-document datasets, Common Crawl, CC-News, Wikimedia, and long-context document datasets, do not collectively or exhaustively represent all demographic groups (and proportionally therein). For instance, these datasets do not contain explicit mentions of demographic classes such as age, gender, or ethnicity in approximately 97% to 99.9% of finance reasoning samples and in over 85% of samples across the broader assessed datasets. In the subset where such terms are present, these datasets contain notable representational skews. For example, ethnicity mentions are often dominated by Middle Eastern contexts (found in finance documents) or "White," "Two or more," and "Black or African American" as the most frequent ethnic identifiers, while references categorized as male-only significantly outnumber those categorized as female-only. Furthermore, gender is explicitly mentioned in approximately 12% of samples across the broader dataset assessment, yet in only 0.9% of finance-specific samples. Dataset-level results vary by source type, with long-context/source-document datasets containing higher explicit demographic mention rates compared to certain web-scraped sources. To mitigate these imbalances, we recommend considering evaluation techniques such as bias audits, fine-tuning with demographically balanced datasets, and mitigation strategies such as counterfactual data augmentation to align with the desired model behavior. This evaluation used a 3,000-sample subset per dataset, identified as the optimal threshold for maximizing embedder accuracy. |
10
+ | Unwanted Bias Testing: | Constrained to English-language inputs. Multi-lingual parity is not currently claimed or guaranteed. |
explainability.md ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ | Field | Response |
2
+ | :---- | :---- |
3
+ | Intended Task/Domain: | Text generation, reasoning, and chat |
4
+ | Model Type: | Text-to-text Mamba2-Transformer Hybrid |
5
+ | Intended Users: | Generative AI creators working with conversational AI models and image content. |
6
+ | Output: | Text |
7
+ | Tools used to evaluate datasets to identify synthetic data and ensure data authenticity. | We used a Gemma-3 4B-based filtering model fine-tuned on [Nemotron Content Safety Dataset v2](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0) to ensure the quality of synthetic data. |
8
+ | Describe how the model works: | Generates text by predicting the next word or token based on the context provided in the input sequence using multiple self-attention layers. |
9
+ | Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Age, Disability Status, Gender Identity, Nationality, Physical Appearance, Ethnicity, Socioeconomic Status, Sexual Orientation, Religion |
10
+ | Technical Limitations & Mitigation: | This model performs particularly well in instruction following regimes, as such may be strongly influenced by untrusted inputs and should be paired with appropriate guardrails and data filtering to better align use-case behaviors when exposed to such data. |
11
+ | Verified to have met prescribed NVIDIA quality standards: | Yes |
12
+ | Performance Metrics: | Accuracy, Throughput, and User-side throughput |
13
+ | Potential Known Risks: | The model was optimized explicitly for instruction following and as such is more susceptible to prompt injection and jailbreaking in various forms as a result of its instruction tuning. This means that the model should be paired with additional rails or system filtering to limit exposure to instructions from malicious sources -- either directly or indirectly by retrieval (e.g. via visiting a website) -- as they may yield outputs that can lead to harmful, system-level outcomes up to and including remote code execution in agentic systems when effective security controls including guardrails are not in place. The model may generate answers that may be inaccurate, omit key information, include irrelevant or redundant text, or produce socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. |
14
+ | Licensing: | Use of this model is governed by the [OpenMDW License Agreement, version 1.1](https://raw.githubusercontent.com/OpenMDW/OpenMDW/refs/heads/main/1.1/LICENSE.OpenMDW-1.1) (OpenMDW-1.1). |
privacy.md ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ | Privacy Information |
2
+ | :--- |
3
+ | Nemotron 3 Ultra was trained on large-scale publicly available data that may contain images, audio-video, and text relating to people. NVIDIA collected and used this data in compliance with applicable data protection and privacy laws. This model was not designed to derive insights or otherwise learn from any personal data contained in the datasets. |
4
+ | NVIDIA uses a combination of filters, data minimization techniques, and other guardrails to help prevent personal data from being recited by our models. We employ automated tools and data processing techniques during pre-training or training to identify and filter certain categories of personal data. |
5
+ | Please review NVIDIA's [Privacy Policy](https://www.nvidia.com/en-us/about-nvidia/privacy-policy/) for more information. |
safety.md ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ | Field | Response |
2
+ | :---- | :---- |
3
+ | Model Application Field(s): | Chat, Instruction Following, Chatbot Development, Code Generation, Reasoning, Customer Service |
4
+ | Describe the life critical impact (if present). | Not Applicable |
5
+ | Description of methods implemented in data acquisition or processing, if any, to address other types of potentially harmful data in the training, testing, and validation data: | We used a guard model for content safety to exclude potentially harmful data from training. |
6
+ | Description of any methods implemented in data acquisition or processing, if any, to address illegal or harmful content in the training data, including, but not limited to, child sexual abuse material (CSAM) and non-consensual intimate imagery (NCII) | We used a Gemma-3 4B-based guard model trained on [Nemotron Content Safety Dataset v2](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0) for content safety to exclude potentially illegal or harmful content from the training. |
7
+ | Use Case Restrictions: | Use of this model is governed by the [OpenMDW License Agreement, version 1.1](https://raw.githubusercontent.com/OpenMDW/OpenMDW/refs/heads/main/1.1/LICENSE.OpenMDW-1.1) (OpenMDW-1.1).|
8
+ | Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |
9
+ | This AI model was developed based on our policies to ensure responsible data handling and risk mitigation. The datasets used for training have been scanned for harmful content and illegal content, consistent with our policies including scanning for Child Sexual Abuse Material (CSAM). Ongoing review and monitoring mechanisms are in place based on our policies and to maintain data integrity. | True. We use [Nemotron Content Safety Dataset V2](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0) and an internal safety dataset specialized for minority sexuality for content safety evaluation to ensure the safety of this model. |