---
language: es
datasets:
- projecte-aina/cv17_es_other_automatically_verified
tags:
- audio
- automatic-speech-recognition
- spanish
- parakeet-rnnt-1.1b
- NeMo
- projecte-aina
- barcelona-supercomputing-center
- bsc
license: apache-2.0
model-index:
- name: parakeet-rnnt-1.1b_cv17_es_ep18_1270h
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice 17.0 Spanish (Test)
      type: mozilla-foundation/common_voice_17_0
      split: test
      args:
        language: es
    metrics:
    - name: WER
      type: wer
      value: 3.93
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice 17.0 Spanish (Dev)
      type: mozilla-foundation/common_voice_17_0
      split: dev
      args:
        language: es
    metrics:
    - name: WER
      type: wer
      value: 3.55
library_name: nemo
---
# parakeet-rnnt-1.1b_cv17_es_ep18_1270h

## Table of Contents
<details>
<summary>Click to expand</summary>

- [Paper](#paper)
- [Model Summary](#model-summary)
- [Intended Uses and Limitations](#intended-uses-and-limitations)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
- [Training Details](#training-details)
- [Citation](#citation)
- [Additional Information](#additional-information)

</details>


## Paper

**PDF:** [Automatic Validation of the Non-Validated Spanish Speech Data of Common Voice 17.0](https://dspace.ut.ee/server/api/core/bitstreams/918ec35c-a079-4258-b20d-07275ea28ae4/content)

## Model Summary

The "parakeet-rnnt-1.1b_cv17_es_ep18_1270h" is an acoustic model based on ["nvidia/parakeet-rnnt-1.1b"](https://huggingface.co/nvidia/parakeet-rnnt-1.1b) suitable for Automatic Speech Recognition in Spanish.

## Intended Uses and Limitations

This model can be used for Automatic Speech Recognition (ASR) in Spanish. The model is intended to transcribe audio files in Spanish to plain text without punctuation.

## How to Get Started with the Model

To see an updated and functional version of this code, please the NVIDIA's official [repository](https://huggingface.co/nvidia/parakeet-rnnt-1.1b)

### Installation

In order to use this model, you may install the [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo):

Create a virtual environment:
```bash
python -m venv /path/to/venv
```
Activate the environment:
```bash
source /path/to/venv/bin/activate
```
Install the modules:
```bash
BRANCH = 'main'
python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]
```

### For Inference
In order to transcribe audio in Spanish using this model, you can follow this example:

```python
import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="projecte-aina/parakeet-rnnt-1.1b_cv17_es_ep18_1270h")

output = asr_model.transcribe(['YOUR_WAV_FILE.wav'])
print(output[0].text)

```

## Training Details

### Training data

The specific datasets used to create the model are the ["cv17_es_other_automatically_verified"](https://huggingface.co/datasets/projecte-aina/cv17_es_other_automatically_verified) (784 hours and 50 minutes) in combination with around 485 hours of Spanish data taken from the split called "validated" of [Mozilla Common Voice 17.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)

### Training procedure

This model is the result of finetuning the model ["parakeet-rnnt-1.1b"](https://huggingface.co/nvidia/parakeet-rnnt-1.1b) by following this [tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Transducers_with_HF_Datasets.ipynb)

### Training Hyperparameters

* language: spanish
* hours of training audio: 1270
* learning rate: 2e-4
* devices=4
* num_nodes=8
* batch_size=8
* accelerator=accelerator
* strategy="ddp"
* max_epochs=50
* enable_checkpointing=True
* logger=False
* log_every_n_steps=100
* check_val_every_n_epoch=1
* precision='bf16-mixed'
* callbacks=[checkpoint_callback]

## Citation
If this model contributes to your research, please cite the work:

```bibtex
@inproceedings{mena2025automatic,
  title={Automatic Validation of the Non-Validated Spanish Speech Data of Common Voice 17.0},
  author={Mena, Carlos Daniel Hern{\'a}ndez and Scalvini, Barbara and {\'\i} L{\'a}g, D{\'a}vid},
  booktitle={Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)},
  pages={58--63},
  year={2025}
}
```

<!--
@misc{mena2024parakeetspanish,
      title={Acoustic Model in Spanish: parakeet-rnnt-1.1b_cv17_es_ep18_1270h.}, 
      author={Hernandez Mena, Carlos Daniel},
      organization={Barcelona Supercomputing Center},
      url={https://huggingface.co/projecte-aina/parakeet-rnnt-1.1b_cv17_es_ep18_1270h},
      year={2024}
}
-->

## Additional Information

### Author

The fine-tuning process was perform during November (2024) in the [Language Technologies Unit](https://huggingface.co/BSC-LT) of the [Barcelona Supercomputing Center](https://www.bsc.es/) by [Carlos Daniel Hernández Mena](https://huggingface.co/carlosdanielhernandezmena).

### Contact
For further information, please send an email to <langtech@bsc.es>.

### Copyright
Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.

### License

[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)

### Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA](https://proyectoilenia.es/) with reference 2022/TL22/00215337.