Automatic Speech Recognition
NeMo
PyTorch
Bambara
speech
audio
Transducer
TDT
FastConformer
Conformer
Bambara
NeMo
Eval Results (legacy)
Instructions to use RobotsMali/soloni-be-kalan-v0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use RobotsMali/soloni-be-kalan-v0 with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("RobotsMali/soloni-be-kalan-v0") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
File size: 6,168 Bytes
d9e23ef | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | ---
language:
- bm
library_name: nemo
datasets:
- RobotsMali/an-be-kalan-bench
thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- Transducer
- TDT
- FastConformer
- Conformer
- pytorch
- Bambara
- NeMo
license: cc-by-4.0
base_model: RobotsMali/soloni-114m-tdt-ctc-v2
model-index:
- name: soloni-be-kalan-v0
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: An be kalan Children's Reading Benchmark
type: RobotsMali/an-be-kalan-bench
split: test
args:
language: bm
metrics:
- name: Test WER
type: wer
value: 22.0
- name: Test CER
type: cer
value: 8.0
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
---
# Soloni Be Kalan (TDT-CTC 114M)
<style>
img {
display: inline;
}
</style>
[](#model-architecture)
| [](#model-architecture)
| [](#datasets)
`soloni-be-kalan-v0` is a domain-specific fine-tuned version of [`RobotsMali/soloni-114m-tdt-ctc-v2`](https://huggingface.co/RobotsMali/soloni-114m-tdt-ctc-v2). This model was adapted specifically for Bambara educational materials and child speech applications. The model was fine-tuned using **NVIDIA NeMo** and supports **both TDT (Token-and-Duration Transducer) and CTC (Connectionist Temporal Classification) decoding**.
## **🚨 Important Note**
This model, along with its associated resources, is part of an **ongoing research effort** by the RobotsMali AI4D Lab. Users should be aware that:
*
**Early Childhood Performance Gap:** While this model significantly reduces the baseline error rate on early childhood speech (<10 years) from 56% down to 29% WER, physiological features unique to young children (unformed acoustic profiles, erratic speech rates) continue to present an out-of-domain challenge compared to older cohorts.
*
**Structural Dependencies:** The model performs exceptionally well on fluid, sequential storytelling text (achieving down to 7% WER) , but faces structural limitations on short, highly repetitive, sparse token arrays (e.g., *Kuloriw* or *Jate*) where language model prior biases dominate.
## NVIDIA NeMo: Training
To fine-tune or run inference with this model, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend installing it alongside a compatible PyTorch environment.
```bash
pip install nemo-toolkit['asr']
```
## How to Use This Model
### Load Model with NeMo
```python
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="RobotsMali/soloni-be-kalan-v0")
```
### Transcribe Audio
```python
# Accepts 16 kHz mono-channel audio wav files (no need to resample manually if your audio isn't 16kHz, the preprocessor will do)
asr_model.transcribe(['sample_child_reading.wav'])
```
If you encounter a `RuntimeError: CUDA error: invalid argument` due to GPU compatibility with CUDA Graphs in the TDT decoder, disable it in your configuration before transcribing:
```python
decoding_cfg = asr_model.cfg.decoding
decoding_cfg.greedy.use_cuda_graph_decoder = False
asr_model.change_decoding_strategy(decoding_cfg=decoding_cfg)
```
## Model Architecture
This model utilizes the **Hybrid FastConformer-TDT-CTC** architecture with 114 million parameters. FastConformer optimizes the standard Conformer model with 8x depthwise-separable convolutional downsampling. It features two independent but jointly trained decoders: an auto-regressive TDT decoder (default branch) and a convolutional decoder optimized via CTC loss.
## Training & Fine-Tuning Configurations
Fine-tuning was executed under strict low-resource optimization constraints:
*
**Optimization Framework:** Regularized using an **Early Stopping mechanism** with a **15-epoch patience window** based on a validation set matching the benchmark distribution.
*
**Convergence Behavior:** Due to high acoustic data density acting as a natural regularizer, training safely concluded at **epoch 20**, successfully preventing the vocabulary collapse and lexical overfitting typical of small, pristine speech corpora.
*
**Augmentation Strategy:** This configuration explicitly omitted synthetic spectral masking (`SpecAugment=None`) , demonstrating that physical voice variance from natural human speakers acts as a superior regularizer than artificial noise injection in this specific domain.
## Dataset
The model was fine-tuned on the combined **Main + Duplicate** expanded subset (totaling **45.6 hours**) of the [RobotsMali/an-be-kalan-bench](https://huggingface.co/datasets/RobotsMali/an-be-kalan-bench) dataset.
*
**Main Split (1.6h):** Clean readings of 22 GAIFE project books recorded by 8 unique speakers.
*
**Duplicate Split (44h):** A highly dense, multi-speaker redundant corpus featuring natural human speech variations (pitch, accent, child speech dynamics) reading the identical source literature.
## Performance
Performance is disaggregated below across overall results, specific age cohorts, and distinctive book structures using Word Error Rate (WER) and Character Error Rate (CER).
### Overall Evaluation
| Model | Decoding Branch | WER (%) ↓ | CER (%) ↓ | Status |
| --- | --- | --- | --- | --- |
| **soloni-be-kalan** | CTC | **22.0%** | **8.0%** | <br>*Newly deployed model in An be Kalan app* |
| *soloni-114m-tdt-ctc-v2 (Base)* | CYC | *42.0%* | *15.0%* | <br>*Pre-trained Baseline Reference* |
### Demographic Cohort Breakdown
| Age Cohort | Utterance Count | Baseline WER (%) | Fine-Tuned WER (%) | Key Insights |
| --- | --- | --- | --- | --- |
| **Early Childhood (<10 yrs)** | 93 | 56.0% | **29.0%** | Remains the single largest acoustic error cluster.|
| **Target Cohort (10-15 yrs)** | 527 | — | **22.0%** | Majority representation; stable acoustic profiles.|
## License
This model is released under the **CC-BY-4.0** license.
|