Update README.md

2af213d verified over 1 year ago

5.71 kB

	---
	license: mit
	datasets:
	- Isma/alffa_wolof
	language:
	- wo
	metrics:
	- wer
	base_model:
	- facebook/mms-1b
	pipeline_tag: automatic-speech-recognition
	---
	# wav2vec2-large-mms-1b-wolof

	This model is a fine-tuned version of [facebook/mms-1b-all](https://huggingface.co/facebook/mms-1b-all) on the Isma/alffa_wolof dataset. It is designed to perform automatic speech recognition (ASR) in the Wolof language.

	## Model description

	This model is based on the Wav2Vec 2.0 architecture, which has been fine-tuned for speech recognition tasks. The base model, facebook/mms-1b-all, was trained on a multilingual corpus for general-purpose ASR. This fine-tuned version has been specifically trained on the Waxal Wolof dataset, which contains audio recordings in the Wolof language.

	## Training and evaluation data

	The model was trained on the Isma/alffa_wolof dataset, which contains audio samples in the Wolof language. This dataset is used to fine-tune the model to improve accuracy on the specific phonetic characteristics of Wolof speech.

	## Inference manually

	```python
	! pip install datasets

	# Load test dataset
	from datasets import load_dataset, Audio

	dataset = load_dataset("perrynelson/waxal-wolof", trust_remote_code=True)
	dataset

	# Display the first audio using Ipython
	from IPython.display import Audio, display

	Audio(dataset['train'][322]['audio']['array'], rate=16000)

	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
	import torch

	model_id = "bilalfaye/wav2vec2-large-mms-1b-wolof"

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	# Load the model on CPU first
	model = Wav2Vec2ForCTC.from_pretrained(model_id,
	target_lang="wol",
	torch_dtype=torch.float16 # Use half-precision
	).to(device)


	processor = Wav2Vec2Processor.from_pretrained(model_id)
	processor.tokenizer.set_target_lang("wol")


	# Process the audio
	input_dict = processor(
	dataset['train'][322]["audio"]["array"],
	sampling_rate=16_000,
	return_tensors="pt",
	padding=True
	)

	# Move inputs to the appropriate device for the first processing layer
	input_values = input_dict.input_values.to(device, dtype=torch.float16)

	# Perform inference
	logits = model(input_values).logits

	# Decode predictions
	pred_ids = torch.argmax(logits, dim=-1)[0]

	print("Prediction:")
	print(processor.decode(pred_ids))

	print("\nReference:")
	print(dataset['train'][322]['transcription'].lower())
	```

	## Inference with pipeline

	```python
	from transformers import pipeline
	import torch

	# Model ID
	model_id = "bilalfaye/wav2vec2-large-mms-1b-wolof"

	# Determine device (use GPU if available, otherwise fallback to CPU)
	device = 0 if torch.cuda.is_available() else -1

	# Use half precision (float16) for inference if GPU is available
	torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

	# Set up the pipeline for automatic speech recognition
	pipe = pipeline(
	task="automatic-speech-recognition",
	model=model_id,
	processor=model_id,
	device=device, # Specify the device (GPU if available, otherwise CPU)
	torch_dtype=torch_dtype, # Set the precision (float16 for half precision, float32 otherwise)
	framework="pt" # Use PyTorch as the framework
	)

	# Input audio processing
	audio_array = dataset['train'][322]["audio"]["array"] # Fetching an audio sample

	# Run inference
	result = pipe(audio_array)

	# Prediction
	print("Prediction:")
	print(result['text'])

	# Reference (for comparison)
	print("\nReference:")
	print(dataset['train'][322]['transcription'].lower())
	```

	---

	## Free memory

	```python
	import gc
	import torch
	import psutil

	# Free up unused memory in CUDA (GPU) - only needed if you use a GPU
	if torch.cuda.is_available():
	torch.cuda.empty_cache() # Clears GPU memory cache
	torch.cuda.reset_peak_memory_stats() # Resets memory stats

	# Collect any unused memory in Python (CPU)
	gc.collect() # Collect unused memory in Python's garbage collector

	# Optionally, check memory status after clearing
	if torch.cuda.is_available():
	print(f"GPU Memory Allocated: {torch.cuda.memory_allocated()} bytes")
	print(f"GPU Memory Cached: {torch.cuda.memory_reserved()} bytes")
	else:
	print(f"CPU Memory Usage: {psutil.virtual_memory().percent}%")
	```

	---

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0001
	- train_batch_size: 16
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 500
	- num_epochs: 20
	- mixed_precision_training: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Wer \|
	\|:-------------:\|:-----:\|:-----:\|:---------------:\|:------:\|
	\| 0.3793 \| 14.0 \| 12250 \| 0.1517 \| 0.1888 \|
	\| 0.3709 \| 15.0 \| 13125 \| 0.1512 \| 0.1882 \|
	\| 0.3702 \| 16.0 \| 14000 \| 0.1499 \| 0.1858 \|
	\| 0.367 \| 17.0 \| 14875 \| 0.1492 \| 0.1848 \|
	\| 0.3656 \| 18.0 \| 15750 \| 0.1493 \| 0.1842 \|


	### Framework versions

	- Transformers 4.41.2
	- Pytorch 2.4.0+cu121
	- Datasets 3.2.0
	- Tokenizers 0.19.1

	---

	## Intended uses & limitations

	- Intended uses: This model is intended for speech-to-text tasks in Wolof. It can be used to transcribe audio recordings in Wolof into written text.
	- Limitations: This model performs best with clean audio and may struggle with noisy or low-quality recordings. It is designed specifically for the Wolof language and may not work well with other languages.



	### Author Information

	- Author: Bilal FAYE