Automatic Speech Recognition
NeMo
PyTorch
Bambara
speech
audio
Transducer
FastConformer
Conformer
Bambara
NeMo
Eval Results (legacy)
Instructions to use RobotsMali/soloba-tdt-0.6b-v1.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use RobotsMali/soloba-tdt-0.6b-v1.5 with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("RobotsMali/soloba-tdt-0.6b-v1.5") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
| language: | |
| - bm | |
| library_name: nemo | |
| datasets: | |
| - RobotsMali/kunkado | |
| thumbnail: null | |
| tags: | |
| - automatic-speech-recognition | |
| - speech | |
| - audio | |
| - Transducer | |
| - FastConformer | |
| - Conformer | |
| - pytorch | |
| - Bambara | |
| - NeMo | |
| license: cc-by-4.0 | |
| base_model: RobotsMali/soloba-tdt-0.6b-v0.5 | |
| model-index: | |
| - name: soloba-tdt-0.6b-v1.5 | |
| results: | |
| - task: | |
| name: Automatic Speech Recognition | |
| type: automatic-speech-recognition | |
| dataset: | |
| name: Kunkado | |
| type: RobotsMali/kunkado | |
| split: test | |
| args: | |
| language: bm | |
| metrics: | |
| - name: Test WER | |
| type: wer | |
| value: 39.7866505648225 | |
| - name: Test CER | |
| type: cer | |
| value: 23.216155838453484 | |
| - task: | |
| name: Automatic Speech Recognition | |
| type: automatic-speech-recognition | |
| dataset: | |
| name: Nyana Eval | |
| type: RobotsMali/nyana-eval | |
| split: test | |
| args: | |
| language: bm | |
| metrics: | |
| - name: Test WER | |
| type: wer | |
| value: 39.813084 | |
| - name: Test CER | |
| type: cer | |
| value: 22.908453 | |
| metrics: | |
| - wer | |
| - cer | |
| pipeline_tag: automatic-speech-recognition | |
| # Soloba-TDT-600M Series | |
| <style> | |
| img { | |
| display: inline; | |
| } | |
| </style> | |
| [](#model-architecture) | |
| | [](#model-architecture) | |
| | [](#datasets) | |
| `soloba-tdt-0.6b-v1.5` is a fine tuned version of [`RobotsMali/soloba-tdt-0.6b-v0.5`](https://huggingface.co/RobotsMali/soloba-ctc-0.6b-v2) on RobotsMali/kunkado. This model does not consistently produce Capitalizations and Punctuations and it cannot produce acoustic event tags like those found in Kunkado its transcriptions. It was fine-tuned using **NVIDIA NeMo**. | |
| ## **🚨 Important Note** | |
| This model, along with its associated resources, is part of an **ongoing research effort**, improvements and refinements are expected in future versions. Users should be aware that: | |
| - **The model may not generalize very well accross all speaking conditions and dialects.** | |
| - **Community feedback is welcome, and contributions are encouraged to refine the model further.** | |
| ## NVIDIA NeMo: Training | |
| To fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version. | |
| ```bash | |
| pip install nemo-toolkit['asr'] | |
| ``` | |
| ## How to Use This Model | |
| Note that this model has been released for research purposes primarily. | |
| ### Load Model with NeMo | |
| ```python | |
| import nemo.collections.asr as nemo_asr | |
| asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="RobotsMali/soloba-tdt-0.6b-v1.5") | |
| ``` | |
| ### Transcribe Audio | |
| ```python | |
| model.eval() | |
| # Assuming you have a test audio file named sample_audio.wav | |
| asr_model.transcribe(['sample_audio.wav']) | |
| ``` | |
| ### Input | |
| This model accepts any **mono-channel audio (wav files)** as input and resamples them to *16 kHz sample rate* before performing the forward pass | |
| ### Output | |
| This model provides transcribed speech as an hypothesis object with a text attribute containing the transcription string for a given speech sample. (nemo>=2.3) | |
| ## Model Architecture | |
| This model uses a FastConformer Ecoder and an autoregressive Token-and-Duration Transducer decoder, a variant of RNN-T that predicts jointly learn to predict a token and its duration. FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer). | |
| ## Training | |
| The NeMo toolkit was used for finetuning this model for **40,000 steps** over `RobotsMali/soloba-tdt-0.6b-v0.5` model with bacth_size 32. The finetuning codes and configurations can be found at [RobotsMali-AI/bambara-asr](https://github.com/RobotsMali-AI/bambara-asr/). | |
| The tokenizer for this model was trained on the text transcripts of the train set of RobotsMali/kunkado using this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py). | |
| ## Dataset | |
| This model was fine-tuned on the [kunkado](https://huggingface.co/datasets/RobotsMali/kunkado) dataset, the human-reviewed subset, which consists of **~40 hours of transcribed Bambara speech data**. The text was normalized with the [bambara-normalizer](https://pypi.org/project/bambara-normalizer/) prior to training, normalizing numbers, removing punctuations and removings tags. | |
| ## Performance | |
| We report the Word Error Rate (WER) and Character Error Rate (CER) for this model: | |
| | Benchmark | Decoding | WER (%) ↓ | CER (%) ↓ | | |
| |---------------|----------|-----------------|-----------------| | |
| | Kunkado | CTC | 39.78 | 23.21 | | |
| | Nyana Eval | CTC | 39.81 | 22.90 | | |
| ## License | |
| This model is released under the **CC-BY-4.0** license. By using this model, you agree to the terms of the license. | |
| --- | |
| Feel free to open a discussion on Hugging Face or [file an issue](https://github.com/RobotsMali-AI/bambara-asr/issues) on GitHub for help or contributions. |