--- language: es datasets: - projecte-aina/cv17_es_other_automatically_verified tags: - audio - automatic-speech-recognition - spanish - parakeet-rnnt-1.1b - NeMo - projecte-aina - barcelona-supercomputing-center - bsc license: apache-2.0 model-index: - name: parakeet-rnnt-1.1b_cv17_es_ep18_1270h results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice 17.0 Spanish (Test) type: mozilla-foundation/common_voice_17_0 split: test args: language: es metrics: - name: WER type: wer value: 3.93 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice 17.0 Spanish (Dev) type: mozilla-foundation/common_voice_17_0 split: dev args: language: es metrics: - name: WER type: wer value: 3.55 library_name: nemo --- # parakeet-rnnt-1.1b_cv17_es_ep18_1270h ## Table of Contents
Click to expand - [Paper](#paper) - [Model Summary](#model-summary) - [Intended Uses and Limitations](#intended-uses-and-limitations) - [How to Get Started with the Model](#how-to-get-started-with-the-model) - [Training Details](#training-details) - [Citation](#citation) - [Additional Information](#additional-information)
## Paper **PDF:** [Automatic Validation of the Non-Validated Spanish Speech Data of Common Voice 17.0](https://dspace.ut.ee/server/api/core/bitstreams/918ec35c-a079-4258-b20d-07275ea28ae4/content) ## Model Summary The "parakeet-rnnt-1.1b_cv17_es_ep18_1270h" is an acoustic model based on ["nvidia/parakeet-rnnt-1.1b"](https://huggingface.co/nvidia/parakeet-rnnt-1.1b) suitable for Automatic Speech Recognition in Spanish. ## Intended Uses and Limitations This model can be used for Automatic Speech Recognition (ASR) in Spanish. The model is intended to transcribe audio files in Spanish to plain text without punctuation. ## How to Get Started with the Model To see an updated and functional version of this code, please the NVIDIA's official [repository](https://huggingface.co/nvidia/parakeet-rnnt-1.1b) ### Installation In order to use this model, you may install the [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo): Create a virtual environment: ```bash python -m venv /path/to/venv ``` Activate the environment: ```bash source /path/to/venv/bin/activate ``` Install the modules: ```bash BRANCH = 'main' python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all] ``` ### For Inference In order to transcribe audio in Spanish using this model, you can follow this example: ```python import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="projecte-aina/parakeet-rnnt-1.1b_cv17_es_ep18_1270h") output = asr_model.transcribe(['YOUR_WAV_FILE.wav']) print(output[0].text) ``` ## Training Details ### Training data The specific datasets used to create the model are the ["cv17_es_other_automatically_verified"](https://huggingface.co/datasets/projecte-aina/cv17_es_other_automatically_verified) (784 hours and 50 minutes) in combination with around 485 hours of Spanish data taken from the split called "validated" of [Mozilla Common Voice 17.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) ### Training procedure This model is the result of finetuning the model ["parakeet-rnnt-1.1b"](https://huggingface.co/nvidia/parakeet-rnnt-1.1b) by following this [tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Transducers_with_HF_Datasets.ipynb) ### Training Hyperparameters * language: spanish * hours of training audio: 1270 * learning rate: 2e-4 * devices=4 * num_nodes=8 * batch_size=8 * accelerator=accelerator * strategy="ddp" * max_epochs=50 * enable_checkpointing=True * logger=False * log_every_n_steps=100 * check_val_every_n_epoch=1 * precision='bf16-mixed' * callbacks=[checkpoint_callback] ## Citation If this model contributes to your research, please cite the work: ```bibtex @inproceedings{mena2025automatic, title={Automatic Validation of the Non-Validated Spanish Speech Data of Common Voice 17.0}, author={Mena, Carlos Daniel Hern{\'a}ndez and Scalvini, Barbara and {\'\i} L{\'a}g, D{\'a}vid}, booktitle={Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)}, pages={58--63}, year={2025} } ``` ## Additional Information ### Author The fine-tuning process was perform during November (2024) in the [Language Technologies Unit](https://huggingface.co/BSC-LT) of the [Barcelona Supercomputing Center](https://www.bsc.es/) by [Carlos Daniel Hernández Mena](https://huggingface.co/carlosdanielhernandezmena). ### Contact For further information, please send an email to . ### Copyright Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center. ### License [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) ### Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA](https://proyectoilenia.es/) with reference 2022/TL22/00215337.