--- license: other datasets: - Remeinium/CleanSinhalaTextCorpus language: - si pipeline_tag: feature-extraction library_name: fasttext tags: - sinhala - fasttext - vectors - embedding - nlp - low-resource-languages - remeinium --- # UgannA Siyabasa V2 — FastText Sinhala Embedding Model 🇱🇰 > UgannA_SiyabasaV2 (උගන්නැ සියබස) is the first public FastText embedding model released by Remeinium Corp. The name comes from Kumaratunga Munidasa’s timeless quote: “උගන්නැ සියබස – මත් වන්නැ එහි රසයෙන්” Learn Sinhala – be intoxicated with its beauty. Just as Munidasa envisioned nurturing the Sinhala language, this model represents teaching it to machines. 📌 Key Features - Type: FastText - Vector size: 300 dimensions - File size: ~3.94GB - Training data: 17GB processed Sinhala text # 🔧 Usage ```python import fasttext # Load model model = fasttext.load_model("Remeinium/UgannA_Siyabasa/UgannA_Siyabasa.bin") # Get vector for a word vector = model.get_word_vector("අම්මා") # Get nearest neighbors neighbors = model.get_nearest_neighbors("අම්මා", k=10) print(neighbors) ``` # Use API - Test Live: [Embedding Playground]( https://huggingface.co/spaces/Remeinium/Embedding_Siyabasa ) - API Docs: [Go to API Console]( https://esdocs.ai.remeinium.com ) # 📂 Training Data - Processed and cleaned training corpus: ~17GB - Preprocessing: tokenization, normalization, deduplication # 🗜️ License This model is released under the **[Remeinium Open Model License (ROML)]( https://huggingface.co/Remeinium/UgannA_SiyabasaV2/blob/main/LICENCE )**. It permits research and commercial use with attribution. See the LICENSE file for full terms. # ⚠️ Limitations - May reflect cultural/linguistic biases from sources. - Optimized for Sinhala; not multilingual. # 🤝 Collaboration You are welcome to: - Use this model for research & your projects - Share improvements, benchmarks, or downstream applications - Contact : 📧 support@remeinium.com