---
license: mit
datasets:
- Isma/alffa_wolof
language:
- wo
metrics:
- wer
base_model:
- facebook/mms-1b
pipeline_tag: automatic-speech-recognition
---
# wav2vec2-large-mms-1b-wolof

This model is a fine-tuned version of [facebook/mms-1b-all](https://huggingface.co/facebook/mms-1b-all) on the **Isma/alffa_wolof** dataset. It is designed to perform automatic speech recognition (ASR) in the Wolof language.

## Model description

This model is based on the Wav2Vec 2.0 architecture, which has been fine-tuned for speech recognition tasks. The base model, **facebook/mms-1b-all**, was trained on a multilingual corpus for general-purpose ASR. This fine-tuned version has been specifically trained on the **Waxal Wolof** dataset, which contains audio recordings in the Wolof language.

## Training and evaluation data

The model was trained on the **Isma/alffa_wolof** dataset, which contains audio samples in the Wolof language. This dataset is used to fine-tune the model to improve accuracy on the specific phonetic characteristics of Wolof speech.

## Inference manually

```python
! pip install datasets

# Load test dataset
from datasets import load_dataset, Audio

dataset = load_dataset("perrynelson/waxal-wolof", trust_remote_code=True)
dataset

# Display the first audio using Ipython
from IPython.display import Audio, display

Audio(dataset['train'][322]['audio']['array'], rate=16000)

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch

model_id = "bilalfaye/wav2vec2-large-mms-1b-wolof"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model on CPU first
model = Wav2Vec2ForCTC.from_pretrained(model_id, 
                                       target_lang="wol", 
                                       torch_dtype=torch.float16  # Use half-precision
                                       ).to(device)


processor = Wav2Vec2Processor.from_pretrained(model_id)
processor.tokenizer.set_target_lang("wol")


# Process the audio
input_dict = processor(
    dataset['train'][322]["audio"]["array"],
    sampling_rate=16_000,
    return_tensors="pt",
    padding=True
)

# Move inputs to the appropriate device for the first processing layer
input_values = input_dict.input_values.to(device, dtype=torch.float16)

# Perform inference
logits = model(input_values).logits

# Decode predictions
pred_ids = torch.argmax(logits, dim=-1)[0]

print("Prediction:")
print(processor.decode(pred_ids))

print("\nReference:")
print(dataset['train'][322]['transcription'].lower())
```

## Inference with pipeline

```python
from transformers import pipeline
import torch

# Model ID
model_id = "bilalfaye/wav2vec2-large-mms-1b-wolof"

# Determine device (use GPU if available, otherwise fallback to CPU)
device = 0 if torch.cuda.is_available() else -1

# Use half precision (float16) for inference if GPU is available
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Set up the pipeline for automatic speech recognition
pipe = pipeline(
    task="automatic-speech-recognition", 
    model=model_id, 
    processor=model_id, 
    device=device,  # Specify the device (GPU if available, otherwise CPU)
    torch_dtype=torch_dtype,  # Set the precision (float16 for half precision, float32 otherwise)
    framework="pt"  # Use PyTorch as the framework
)

# Input audio processing
audio_array = dataset['train'][322]["audio"]["array"]  # Fetching an audio sample

# Run inference
result = pipe(audio_array)

# Prediction
print("Prediction:")
print(result['text'])

# Reference (for comparison)
print("\nReference:")
print(dataset['train'][322]['transcription'].lower())
```

---

## Free memory

```python
import gc
import torch
import psutil

# Free up unused memory in CUDA (GPU) - only needed if you use a GPU
if torch.cuda.is_available():
    torch.cuda.empty_cache()  # Clears GPU memory cache
    torch.cuda.reset_peak_memory_stats()  # Resets memory stats

# Collect any unused memory in Python (CPU)
gc.collect()  # Collect unused memory in Python's garbage collector

# Optionally, check memory status after clearing
if torch.cuda.is_available():
    print(f"GPU Memory Allocated: {torch.cuda.memory_allocated()} bytes")
    print(f"GPU Memory Cached: {torch.cuda.memory_reserved()} bytes")
else:
    print(f"CPU Memory Usage: {psutil.virtual_memory().percent}%")
```

---

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 20
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step  | Validation Loss | Wer    |
|:-------------:|:-----:|:-----:|:---------------:|:------:|
| 0.3793        | 14.0  | 12250 | 0.1517          | 0.1888 |
| 0.3709        | 15.0  | 13125 | 0.1512          | 0.1882 |
| 0.3702        | 16.0  | 14000 | 0.1499          | 0.1858 |
| 0.367         | 17.0  | 14875 | 0.1492          | 0.1848 |
| 0.3656        | 18.0  | 15750 | 0.1493          | 0.1842 |


### Framework versions

- Transformers 4.41.2
- Pytorch 2.4.0+cu121
- Datasets 3.2.0
- Tokenizers 0.19.1

---

## Intended uses & limitations

- **Intended uses**: This model is intended for speech-to-text tasks in Wolof. It can be used to transcribe audio recordings in Wolof into written text.
- **Limitations**: This model performs best with clean audio and may struggle with noisy or low-quality recordings. It is designed specifically for the Wolof language and may not work well with other languages.


### Author Information

- **Author**: Bilal FAYE