TajikNLPWorld/TajPersParallelLexicalCorpus
Viewer • Updated • 43.8k • 21 • 2
This repository contains a LoRA adapter for the facebook/nllb-200-distilled-600M model, fine‑tuned on the TajikNLPWorld/TajPersParallelLexicalCorpus dataset for the translation task from Tajik (Cyrillic) to Persian (Arabic script).
facebook/nllb-200-distilled-600MThe fine‑tuned model significantly outperforms the zero‑shot baseline (chrF 0.0 → 53.0, BERTScore 0.69 → 0.915).
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from peft import PeftModel
base_model_name = "facebook/nllb-200-distilled-600M"
adapter_path = "TajikNLPWorld/nllb-600m-tajik-persian-lora"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(base_model_name, device_map="auto")
model = PeftModel.from_pretrained(model, adapter_path)
model = model.merge_and_unload() # optional
# Translate
tokenizer.src_lang = "tg_Cyrl"
text = "ришк"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
If you prefer a single model, you can merge the adapter with the base model before loading.
adapter_config.json – LoRA configuration.adapter_model.bin – LoRA weights.results/ – Folder containing evaluation metrics, plots, and predictions.If you use this model, please cite our work (to be added).
This model is released under the Apache 2.0 license, consistent with the base NLLB model.
Developed by: [Arabov Mullosharaf/ TajikNLP]
Contact: [cool.araby@gmail.com]
Base model
facebook/nllb-200-distilled-600M