Instructions to use MALIBA-AI/bambara-embeddings with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- fastText
How to use MALIBA-AI/bambara-embeddings with fastText:
from huggingface_hub import hf_hub_download import fasttext model = fasttext.load_model(hf_hub_download("MALIBA-AI/bambara-embeddings", "model.bin")) - Notebooks
- Google Colab
- Kaggle
| language: bm | |
| tags: | |
| - bambara | |
| - fasttext | |
| - embeddings | |
| - word-vectors | |
| - african-nlp | |
| - low-resource | |
| license: apache-2.0 | |
| datasets: | |
| - bambara-corpus | |
| metrics: | |
| - cosine_similarity | |
| pipeline_tag: feature-extraction | |
| # Bambara FastText Embeddings | |
| ## Model Description | |
| This model provides FastText word embeddings for the Bambara language (Bamanankan), a Mande language spoken primarily in Mali. The embeddings capture semantic relationships between Bambara words and enable various NLP tasks for this low-resource African language. | |
| **Model Type:** FastText Word Embeddings | |
| **Language:** Bambara (bm) | |
| **License:** Apache 2.0 | |
| ## Model Details | |
| ### Model Architecture | |
| - **Algorithm:** FastText with subword information | |
| - **Vector Dimension:** 300 | |
| - **Vocabulary Size:** 9,973 unique Bambara words | |
| - **Training Method:** Skip-gram with negative sampling | |
| - **Subword Information:** Character n-grams (enables handling of out-of-vocabulary words) | |
| ### Training Data | |
| The model was trained on Bambara text corpora, building upon the work of [David Ifeoluwa Adelani's PhD dissertation](https://arxiv.org/abs/2507.00297) on natural language processing for African languages. | |
| ### Intended Use | |
| This model is designed for: | |
| - **Semantic similarity tasks** in Bambara | |
| - **Information retrieval** for Bambara documents | |
| - **Cross-lingual research** involving Bambara | |
| - **Cultural preservation** and digital humanities projects | |
| - **Educational applications** for Bambara language learning | |
| - **Foundation for downstream NLP tasks** in Bambara | |
| ## Installation | |
| ```bash | |
| pip install gensim huggingface_hub scikit-learn numpy | |
| ``` | |
| ## Usage | |
| ### Load the Model | |
| ```python | |
| import tempfile | |
| from gensim.models import KeyedVectors | |
| from huggingface_hub import hf_hub_download | |
| model_id = "MALIBA-AI/bambara-fasttext" | |
| # Download model files | |
| model_path = hf_hub_download(repo_id=model_id, filename="bam.bin", cache_dir=tempfile.gettempdir()) | |
| vectors_path = hf_hub_download(repo_id=model_id, filename="bam.bin.vectors_ngrams.npy", cache_dir=tempfile.gettempdir()) | |
| # Load model | |
| model = KeyedVectors.load(model_path) | |
| print(f"Vocabulary size: {len(model.key_to_index)}") | |
| print(f"Vector dimension: {model.vector_size}") | |
| ``` | |
| ### Get a Word Vector | |
| ```python | |
| vector = model["bamako"] | |
| print(f"Shape: {vector.shape}") # (300,) | |
| ``` | |
| ### Find Similar Words | |
| ```python | |
| similar_words = model.most_similar("dumuni", topn=10) | |
| for word, score in similar_words: | |
| print(f" {word}: {score:.4f}") | |
| ``` | |
| ### Calculate Similarity Between Two Words | |
| ```python | |
| from sklearn.metrics.pairwise import cosine_similarity | |
| vec1 = model["muso"] | |
| vec2 = model["cɛ"] | |
| similarity = cosine_similarity([vec1], [vec2])[0][0] | |
| print(f"Similarity: {similarity:.4f}") | |
| ``` | |
| ### Convert Text to Vector (Average of Word Vectors) | |
| ```python | |
| import numpy as np | |
| def text_to_vector(text, model): | |
| words = text.lower().split() | |
| vectors = [model[w] for w in words if w in model.key_to_index] | |
| if not vectors: | |
| return np.zeros(model.vector_size) | |
| return np.mean(vectors, axis=0) | |
| text_vec = text_to_vector("Mali ye jamana ɲuman ye", model) | |
| print(f"Shape: {text_vec.shape}") # (300,) | |
| ``` | |
| ### Search for Similar Texts | |
| ```python | |
| from sklearn.metrics.pairwise import cosine_similarity | |
| import numpy as np | |
| def search_similar_texts(query, texts, model, top_k=5): | |
| query_vec = text_to_vector(query, model) | |
| results = [] | |
| for i, text in enumerate(texts): | |
| text_vec = text_to_vector(text, model) | |
| if np.any(text_vec): | |
| sim = cosine_similarity([query_vec], [text_vec])[0][0] | |
| results.append((sim, text, i)) | |
| results.sort(key=lambda x: x[0], reverse=True) | |
| return results[:top_k] | |
| texts = [ | |
| "dumuni ɲuman bɛ here di", | |
| "bamako ye Mali faaba ye", | |
| "denmisɛnw bɛ kalan kɛ", | |
| ] | |
| results = search_similar_texts("Mali jamana", texts, model) | |
| for score, text, idx in results: | |
| print(f" [{score:.4f}] {text}") | |
| ``` | |
| ### Check if a Word Exists in the Vocabulary | |
| ```python | |
| word = "bamako" | |
| if word in model.key_to_index: | |
| print(f"'{word}' is in the vocabulary") | |
| else: | |
| print(f"'{word}' is not in the vocabulary") | |
| ``` | |
| ## Limitations | |
| - Vocabulary is limited to 9,973 words (though subword information helps with OOV words) | |
| - Performance depends on the quality and coverage of the training corpus | |
| - May not capture domain-specific terminology well | |
| - Embeddings reflect biases present in the training data | |
| ## References | |
| ```bibtex | |
| @misc{bambara-fasttext, | |
| author = {MALIBA-AI}, | |
| title = {Bambara FastText Embeddings}, | |
| year = {2025}, | |
| publisher = {HuggingFace}, | |
| howpublished = {\url{https://huggingface.co/MALIBA-AI/bambara-fasttext}} | |
| } | |
| @phdthesis{adelani2025nlp, | |
| title={Natural Language Processing for African Languages}, | |
| author={Adelani, David Ifeoluwa}, | |
| year={2025}, | |
| school={Saarland University}, | |
| note={arXiv:2507.00297} | |
| } | |
| ``` | |
| ## License | |
| This project is licensed under Apache 2.0. | |
| ## Contributing | |
| This is a project part of the [MALIBA-AI](https://huggingface.co/MALIBA-AI) initiative with the mission **"No Malian Language Left Behind."** | |
| --- | |
| **MALIBA-AI: Empowering Mali's Future Through Community-Driven AI Innovation** | |
| *"No Malian Language Left Behind"* |