--- library_name: transformers tags: - english-dhivehi-latin license: mit datasets: - alakxender/dhivehi-english-parallel - alakxender/dhivehi-translit-mixed - alakxender/dhivehi-english-pairs-with-metadata language: - dv base_model: - google/flan-t5-base --- # FLAN-T5-base Fine-tuned for Dhivehi-English-Latin Translation This model is a fine-tuned version of [google/flan-t5-base](https://huggingface.co/google/flan-t5-base) for multilingual translation between Dhivehi, English and Latin script. ## Model description The model was trained on a combination of: - English-Dhivehi parallel corpus - Dhivehi-Latin transliteration pairs It supports the following translation directions: - English to Dhivehi (en2dv) - Dhivehi to English (dv2en) - Dhivehi to Latin script (dv2latin) - Latin script to Dhivehi (latin2dv) ## Training and Evaluation The model was trained for 3 epochs with the following parameters: - Learning rate: 3e-4 - Batch size: 2 - Weight decay: 0.01 - Max sequence length: 128 tokens Final training metrics: - Training loss: 1.305 - Training runtime: 535,653 seconds - Training samples per second: 3.733 Final evaluation metrics: - Evaluation loss: 1.167 - ROUGE-1: 0.415 - ROUGE-2: 0.288 - ROUGE-L: 0.402 - ROUGE-Lsum: 0.405 - Evaluation runtime: 131,307 seconds - Evaluation samples per second: 5.077 ## Usage To use the model, you can load it using the `from_pretrained` method: ```python from transformers import T5ForConditionalGeneration, AutoTokenizer model = T5ForConditionalGeneration.from_pretrained("alakxender/flan-t5-base-dhivehi-en-latin") tokenizer = AutoTokenizer.from_pretrained("alakxender/flan-t5-base-dhivehi-en-latin") supported_languages = ["en2dv", "dv2en", "dv2latin", "latin2dv"] def translate(source_text, target_language): prompt = f"{target_language.strip()} {source_text.strip()}" inputs = tokenizer(prompt, return_tensors="pt", max_length=128, truncation=True) output_ids = model.generate( **inputs, max_length=128, min_length=10, num_beams=4, early_stopping=True, no_repeat_ngram_size=3, repetition_penalty=1.2, do_sample=False, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id ) result = tokenizer.decode(output_ids[0], skip_special_tokens=True) return result # Example usage source_text = "Concerns over prepayment of GST raised in parliament" target_language = "en2dv" translated_text = translate(source_text, target_language) print(translated_text) # Output: ރައްޔިތުންގެ މަޖިލީހުގައި ޖީއެސްޓީގެ އަގު ބޮޑުވުމާ ގުޅިގެން ކަންބޮޑުވުން ފާޅުކޮށްފި source_text = "ދުނިޔޭގެ އެކި ކަންކޮޅުތަކުން 1.4 މިލިއަން މީހުން މައްކާއަށް ޖަމާވެފައި" target_language = "dv2en" translated_text = translate(source_text, target_language) print(translated_text) # Output: 1.4 million people gathered in Mecca from different parts of the world source_text = "ވައިބާރުވުމުން ކުޅުދުއްފުށީ އެއާޕޯޓަށް ނުޖެއްސިގެން މޯލްޑިވިއަންގެ ބޯޓެއް އެނބުރި މާލެއަށް" target_language = "dv2latin" translated_text = translate(source_text, target_language) print(translated_text) # Output: Vaibaruvumun kulhudhuhfushee eaapoatah nujehsigen moaldiviange boateh enburi maaleah source_text = "Paakisthaanuge skoolu bahakah dhin hamalaaehgai thin kuhjakaai bodu dhe meehaku maruvehje" target_language = "latin2dv" translated_text = translate(source_text, target_language) print(translated_text) # Output: ޕާކިސްތާނުގެ ސްކޫލު ބަހަކަށް ދިން ހަމަލާއެއްގައި ތިން ކުއްޖަކާއި ބޮޑު ދެ މީހަކު މަރުވެއްޖެ ``` ****Note: The model was fine-tuned mostly on english-dhivehi local news articles. It may not perform well on other domains.***