File size: 6,168 Bytes
d9e23ef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
language:
- bm
library_name: nemo
datasets:
- RobotsMali/an-be-kalan-bench
thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- Transducer
- TDT
- FastConformer
- Conformer
- pytorch
- Bambara
- NeMo
license: cc-by-4.0
base_model: RobotsMali/soloni-114m-tdt-ctc-v2
model-index:
- name: soloni-be-kalan-v0
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: An be kalan Children's Reading Benchmark
      type: RobotsMali/an-be-kalan-bench
      split: test
      args:
        language: bm
    metrics:
    - name: Test WER
      type: wer
      value: 22.0
    - name: Test CER
      type: cer
      value: 8.0
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
---

# Soloni Be Kalan (TDT-CTC 114M)

<style>
img {
 display: inline;
}
</style>

[![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--TDT-blue#model-badge)](#model-architecture)
| [![Model size](https://img.shields.io/badge/Params-114M-green#model-badge)](#model-architecture)
| [![Language](https://img.shields.io/badge/Language-bm-orange#model-badge)](#datasets)

`soloni-be-kalan-v0` is a domain-specific fine-tuned version of [`RobotsMali/soloni-114m-tdt-ctc-v2`](https://huggingface.co/RobotsMali/soloni-114m-tdt-ctc-v2). This model was adapted specifically for Bambara educational materials and child speech applications. The model was fine-tuned using **NVIDIA NeMo** and supports **both TDT (Token-and-Duration Transducer) and CTC (Connectionist Temporal Classification) decoding**.

## **🚨 Important Note**

This model, along with its associated resources, is part of an **ongoing research effort** by the RobotsMali AI4D Lab. Users should be aware that:

* 
**Early Childhood Performance Gap:** While this model significantly reduces the baseline error rate on early childhood speech (<10 years) from 56% down to 29% WER, physiological features unique to young children (unformed acoustic profiles, erratic speech rates) continue to present an out-of-domain challenge compared to older cohorts.


* 
**Structural Dependencies:** The model performs exceptionally well on fluid, sequential storytelling text (achieving down to 7% WER) , but faces structural limitations on short, highly repetitive, sparse token arrays (e.g., *Kuloriw* or *Jate*) where language model prior biases dominate.



## NVIDIA NeMo: Training

To fine-tune or run inference with this model, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend installing it alongside a compatible PyTorch environment.

```bash
pip install nemo-toolkit['asr']

```

## How to Use This Model

### Load Model with NeMo

```python
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="RobotsMali/soloni-be-kalan-v0")

```

### Transcribe Audio

```python
# Accepts 16 kHz mono-channel audio wav files (no need to resample manually if your audio isn't 16kHz, the preprocessor will do)
asr_model.transcribe(['sample_child_reading.wav'])

```

If you encounter a `RuntimeError: CUDA error: invalid argument` due to GPU compatibility with CUDA Graphs in the TDT decoder, disable it in your configuration before transcribing:

```python
decoding_cfg = asr_model.cfg.decoding
decoding_cfg.greedy.use_cuda_graph_decoder = False
asr_model.change_decoding_strategy(decoding_cfg=decoding_cfg)

```

## Model Architecture

This model utilizes the **Hybrid FastConformer-TDT-CTC** architecture with 114 million parameters. FastConformer optimizes the standard Conformer model with 8x depthwise-separable convolutional downsampling. It features two independent but jointly trained decoders: an auto-regressive TDT decoder (default branch) and a convolutional decoder optimized via CTC loss.

## Training & Fine-Tuning Configurations

Fine-tuning was executed under strict low-resource optimization constraints:

* 
**Optimization Framework:** Regularized using an **Early Stopping mechanism** with a **15-epoch patience window** based on a validation set matching the benchmark distribution.

* 
**Convergence Behavior:** Due to high acoustic data density acting as a natural regularizer, training safely concluded at **epoch 20**, successfully preventing the vocabulary collapse and lexical overfitting typical of small, pristine speech corpora.

* 
**Augmentation Strategy:** This configuration explicitly omitted synthetic spectral masking (`SpecAugment=None`) , demonstrating that physical voice variance from natural human speakers acts as a superior regularizer than artificial noise injection in this specific domain.

## Dataset

The model was fine-tuned on the combined **Main + Duplicate** expanded subset (totaling **45.6 hours**) of the [RobotsMali/an-be-kalan-bench](https://huggingface.co/datasets/RobotsMali/an-be-kalan-bench) dataset.

* 
**Main Split (1.6h):** Clean readings of 22 GAIFE project books recorded by 8 unique speakers.

* 
**Duplicate Split (44h):** A highly dense, multi-speaker redundant corpus featuring natural human speech variations (pitch, accent, child speech dynamics) reading the identical source literature.

## Performance

Performance is disaggregated below across overall results, specific age cohorts, and distinctive book structures using Word Error Rate (WER) and Character Error Rate (CER).

### Overall Evaluation

| Model | Decoding Branch | WER (%) ↓ | CER (%) ↓ | Status |
| --- | --- | --- | --- | --- |
| **soloni-be-kalan** | CTC  | **22.0%** | **8.0%** | <br>*Newly deployed model in An be Kalan app* |
| *soloni-114m-tdt-ctc-v2 (Base)* | CYC | *42.0%* | *15.0%* | <br>*Pre-trained Baseline Reference* |

### Demographic Cohort Breakdown

| Age Cohort | Utterance Count | Baseline WER (%) | Fine-Tuned WER (%) | Key Insights |
| --- | --- | --- | --- | --- |
| **Early Childhood (<10 yrs)** | 93 | 56.0% | **29.0%** | Remains the single largest acoustic error cluster.|
| **Target Cohort (10-15 yrs)** | 527 | — | **22.0%** | Majority representation; stable acoustic profiles.|

## License

This model is released under the **CC-BY-4.0** license.