Automatic Speech Recognition
Transformers
NeMo
Safetensors
PyTorch
parakeet_tdt
feature-extraction
speech
audio
Transducer
Transformer
TDT
FastConformer
Conformer
NeMo
hf-asr-leaderboard
Transformers
Eval Results (legacy)
Eval Results
Instructions to use nvidia/parakeet-tdt-0.6b-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/parakeet-tdt-0.6b-v3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="nvidia/parakeet-tdt-0.6b-v3")# Load model directly from transformers import AutoModelForMultimodalLM model = AutoModelForMultimodalLM.from_pretrained("nvidia/parakeet-tdt-0.6b-v3", dtype="auto") - Inference
- Notebooks
- Google Colab
- Kaggle
nithinraok commited on
Commit ·
7938c10
1
Parent(s): 9fc6642
update card
Browse filesSigned-off-by: nithinraok <nithinrao.koluguri@gmail.com>
README.md
CHANGED
|
@@ -805,6 +805,7 @@ img {
|
|
| 805 |
**Supported Languages:**
|
| 806 |
Bulgarian (**bg**), Croatian (**hr**), Czech (**cs**), Danish (**da**), Dutch (**nl**), English (**en**), Estonian (**et**), Finnish (**fi**), French (**fr**), German (**de**), Greek (**el**), Hungarian (**hu**), Italian (**it**), Latvian (**lv**), Lithuanian (**lt**), Maltese (**mt**), Polish (**pl**), Portuguese (**pt**), Romanian (**ro**), Slovak (**sk**), Slovenian (**sl**), Spanish (**es**), Swedish (**sv**), Russian (**ru**), Ukrainian (**uk**)
|
| 807 |
|
|
|
|
| 808 |
|
| 809 |
## <span style="color:#466f00;">Key Features:</span>
|
| 810 |
|
|
@@ -815,9 +816,9 @@ Bulgarian (**bg**), Croatian (**hr**), Czech (**cs**), Danish (**da**), Dutch (*
|
|
| 815 |
* **Long audio** transcription, supporting audio **up to 24 minutes** long with full attention (on A100 80GB) or up to 3 hours with local attention.
|
| 816 |
* Released under a **permissive CC BY 4.0 license**
|
| 817 |
|
| 818 |
-
|
| 819 |
|
| 820 |
-
--
|
| 821 |
|
| 822 |
## Automatic Speech Recognition (ASR) Performance
|
| 823 |
|
|
@@ -833,11 +834,6 @@ This model is ready for commercial/non-commercial use.
|
|
| 833 |
|
| 834 |
**Note 2:** Performance differences may be partly attributed to Portuguese variant differences - our training data uses European Portuguese while most benchmarks use Brazilian Portuguese.
|
| 835 |
|
| 836 |
-
## <span style="color:#466f00;">License/Terms of Use:</span>
|
| 837 |
-
|
| 838 |
-
GOVERNING TERMS: Use of this model is governed by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) license.
|
| 839 |
-
|
| 840 |
-
|
| 841 |
### <span style="color:#466f00;">Deployment Geography:</span>
|
| 842 |
Global
|
| 843 |
|
|
@@ -849,7 +845,8 @@ This model serves developers, researchers, academics, and industries building ap
|
|
| 849 |
|
| 850 |
### <span style="color:#466f00;">Release Date:</span>
|
| 851 |
|
| 852 |
-
08/14/2025
|
|
|
|
| 853 |
|
| 854 |
### <span style="color:#466f00;">Model Architecture:</span>
|
| 855 |
|
|
@@ -936,7 +933,7 @@ print(output[0].text)
|
|
| 936 |
## <span style="color:#466f00;">Software Integration:</span>
|
| 937 |
|
| 938 |
**Runtime Engine(s):**
|
| 939 |
-
* NeMo 2.
|
| 940 |
|
| 941 |
|
| 942 |
**Supported Hardware Microarchitecture Compatibility:**
|
|
@@ -1136,4 +1133,47 @@ NVIDIA believes Trustworthy AI is a shared responsibility and we have establishe
|
|
| 1136 |
|
| 1137 |
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [here](https://developer.nvidia.com/blog/enhancing-ai-transparency-and-ethical-considerations-with-model-card/).
|
| 1138 |
|
| 1139 |
-
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 805 |
**Supported Languages:**
|
| 806 |
Bulgarian (**bg**), Croatian (**hr**), Czech (**cs**), Danish (**da**), Dutch (**nl**), English (**en**), Estonian (**et**), Finnish (**fi**), French (**fr**), German (**de**), Greek (**el**), Hungarian (**hu**), Italian (**it**), Latvian (**lv**), Lithuanian (**lt**), Maltese (**mt**), Polish (**pl**), Portuguese (**pt**), Romanian (**ro**), Slovak (**sk**), Slovenian (**sl**), Spanish (**es**), Swedish (**sv**), Russian (**ru**), Ukrainian (**uk**)
|
| 807 |
|
| 808 |
+
This model is ready for commercial/non-commercial use.
|
| 809 |
|
| 810 |
## <span style="color:#466f00;">Key Features:</span>
|
| 811 |
|
|
|
|
| 816 |
* **Long audio** transcription, supporting audio **up to 24 minutes** long with full attention (on A100 80GB) or up to 3 hours with local attention.
|
| 817 |
* Released under a **permissive CC BY 4.0 license**
|
| 818 |
|
| 819 |
+
## <span style="color:#466f00;">License/Terms of Use:</span>
|
| 820 |
|
| 821 |
+
GOVERNING TERMS: Use of this model is governed by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) license.
|
| 822 |
|
| 823 |
## Automatic Speech Recognition (ASR) Performance
|
| 824 |
|
|
|
|
| 834 |
|
| 835 |
**Note 2:** Performance differences may be partly attributed to Portuguese variant differences - our training data uses European Portuguese while most benchmarks use Brazilian Portuguese.
|
| 836 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 837 |
### <span style="color:#466f00;">Deployment Geography:</span>
|
| 838 |
Global
|
| 839 |
|
|
|
|
| 845 |
|
| 846 |
### <span style="color:#466f00;">Release Date:</span>
|
| 847 |
|
| 848 |
+
Huggingface [08/14/2025](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
|
| 849 |
+
|
| 850 |
|
| 851 |
### <span style="color:#466f00;">Model Architecture:</span>
|
| 852 |
|
|
|
|
| 933 |
## <span style="color:#466f00;">Software Integration:</span>
|
| 934 |
|
| 935 |
**Runtime Engine(s):**
|
| 936 |
+
* NeMo 2.4
|
| 937 |
|
| 938 |
|
| 939 |
**Supported Hardware Microarchitecture Compatibility:**
|
|
|
|
| 1133 |
|
| 1134 |
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [here](https://developer.nvidia.com/blog/enhancing-ai-transparency-and-ethical-considerations-with-model-card/).
|
| 1135 |
|
| 1136 |
+
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
|
| 1137 |
+
|
| 1138 |
+
## <span style="color:#466f00;">Bias:</span>
|
| 1139 |
+
|
| 1140 |
+
Field | Response
|
| 1141 |
+
---------------------------------------------------------------------------------------------------|---------------
|
| 1142 |
+
Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing | None
|
| 1143 |
+
Measures taken to mitigate against unwanted bias | None
|
| 1144 |
+
|
| 1145 |
+
## <span style="color:#466f00;">Explainability:</span>
|
| 1146 |
+
|
| 1147 |
+
Field | Response
|
| 1148 |
+
------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------
|
| 1149 |
+
Intended Domain | Speech to Text Transcription
|
| 1150 |
+
Model Type | FastConformer
|
| 1151 |
+
Intended Users | This model is intended for developers, researchers, academics, and industries building conversational based applications.
|
| 1152 |
+
Output | Text
|
| 1153 |
+
Describe how the model works | Speech input is encoded into embeddings and passed into conformer-based model and output a text response.
|
| 1154 |
+
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of | Not Applicable
|
| 1155 |
+
Technical Limitations & Mitigation | Transcripts may be not 100% accurate. Accuracy varies based on language and characteristics of input audio (Domain, Use Case, Accent, Noise, Speech Type, Context of speech, etc.)
|
| 1156 |
+
Verified to have met prescribed NVIDIA quality standards | Yes
|
| 1157 |
+
Performance Metrics | Word Error Rate
|
| 1158 |
+
Potential Known Risks | If a word is not trained in the language model and not presented in vocabulary, the word is not likely to be recognized. Not recommended for word-for-word/incomplete sentences as accuracy varies based on the context of input text
|
| 1159 |
+
Licensing | GOVERNING TERMS: Use of this model is governed by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) license.
|
| 1160 |
+
|
| 1161 |
+
## <span style="color:#466f00;">Privacy:</span>
|
| 1162 |
+
|
| 1163 |
+
Field | Response
|
| 1164 |
+
----------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------
|
| 1165 |
+
Generatable or reverse engineerable personal data? | None
|
| 1166 |
+
Personal data used to create this model? | None
|
| 1167 |
+
Is there provenance for all datasets used in training? | Yes
|
| 1168 |
+
Does data labeling (annotation, metadata) comply with privacy laws? | Yes
|
| 1169 |
+
Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data.
|
| 1170 |
+
Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/
|
| 1171 |
+
|
| 1172 |
+
## <span style="color:#466f00;">Safety:</span>
|
| 1173 |
+
|
| 1174 |
+
Field | Response
|
| 1175 |
+
---------------------------------------------------|----------------------------------
|
| 1176 |
+
Model Application(s) | Speech to Text Transcription
|
| 1177 |
+
Describe the life critical impact | None
|
| 1178 |
+
Use Case Restrictions | Abide by [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) License
|
| 1179 |
+
Model and dataset restrictions | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.
|