--- language: - bm library_name: nemo datasets: - RobotsMali/afvoices thumbnail: null tags: - automatic-speech-recognition - speech - audio - Transducer - FastConformer - Conformer - pytorch - Bambara - NeMo license: cc-by-4.0 base_model: nvidia/parakeet-tdt-0.6b-v2 model-index: - name: soloba-tdt-0.6b-v0.5 results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: African Next Voices type: RobotsMali/afvoices split: test args: language: bm metrics: - name: Test WER type: wer value: 29.754791835954303 - name: Test CER type: cer value: 13.498655089424755 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Nyana Eval type: RobotsMali/nyana-eval split: test args: language: bm metrics: - name: Test WER type: wer value: 42.4299 - name: Test CER type: cer value: 23.3465 metrics: - wer - cer pipeline_tag: automatic-speech-recognition --- # Soloba-TDT-600M Series [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--CTC-blue#model-badge)](#model-architecture) | [![Model size](https://img.shields.io/badge/Params-0.6B-green#model-badge)](#model-architecture) | [![Language](https://img.shields.io/badge/Language-bm-orange#model-badge)](#datasets) `soloba-tdt-0.6b-v0.5` is a fine tuned version of [`nvidia/parakeet-tdt-0.6b-v2`](https://huggingface.co/nvidia/parakeet-ctc-0.6b) on the African Next Voices dataset (ANV). This model does not consistently produce Capitalizations and Punctuations and it cannot produce acoustic event tags like those found in the ANV dataset in its transcriptions. It was fine-tuned using **NVIDIA NeMo**. This model is the only one of the v0 series that was not trained on RobotsMali/bam-asr-early or any derivative of Jeli-ASR, hence its particular naming (v0.5) ## **🚨 Important Note** This model, along with its associated resources, is part of an **ongoing research effort**, improvements and refinements are expected in future versions. Users should be aware that: - **The model may not generalize very well accross all speaking conditions and dialects.** - **Community feedback is welcome, and contributions are encouraged to refine the model further.** ## NVIDIA NeMo: Training To fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version. ```bash pip install nemo-toolkit['asr'] ``` ## How to Use This Model Note that this model has been released for research purposes primarily. ### Load Model with NeMo ```python import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="RobotsMali/soloba-tdt-0.6b-v0.5") ``` ### Transcribe Audio ```python model.eval() # Assuming you have a test audio file named sample_audio.wav asr_model.transcribe(['sample_audio.wav']) ``` ### Input This model accepts any **mono-channel audio (wav files)** as input and resamples them to *16 kHz sample rate* before performing the forward pass ### Output This model provides transcribed speech as an hypothesis object with a text attribute containing the transcription string for a given speech sample. (nemo>=2.3) ## Model Architecture This model uses a FastConformer Ecoder and an autoregressive Token-and-Duration Transducer decoder, a variant of RNN-T that predicts jointly learn to predict a token and its duration. FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer). ## Training The NeMo toolkit was used for finetuning this model for **82,628 steps** over `nvidia/parakeet-tdt-0.6b-v2` model. The finetuning codes and configurations can be found at [RobotsMali-AI/bambara-asr](https://github.com/RobotsMali-AI/bambara-asr/). The tokenizer for this model was trained on the text transcripts of the train set of RobotsMali/afvoices using this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py). ## Dataset This model was fine-tuned on a 100 hours pre-completion subset of the [African Next Voices](https://huggingface.co/datasets/RobotsMali/afvoices) dataset. You can reconstitute that subset with these [manifest files](https://github.com/RobotsMali-AI/bambara-asr/afvoices/pre-manifests) ## Performance We report the Word Error Rate (WER) and Character Error Rate (CER) for this model: | Benchmark | Decoding | WER (%) ↓ | CER (%) ↓ | |---------------|----------|-----------------|-----------------| | African Next Voices (afvoices) | TDT | 29.75 | 13.50 | | Nyana Eval | TDT | 42.43 | 23.34 | ## License This model is released under the **CC-BY-4.0** license. By using this model, you agree to the terms of the license. --- Feel free to open a discussion on Hugging Face or [file an issue](https://github.com/RobotsMali-AI/bambara-asr/issues) on GitHub for help or contributions.