--- language: dv license: apache-2.0 tags: - ocr - trocr - dhivehi - maldives - thaana pipeline_tag: image-to-text base_model: microsoft/trocr-base-handwritten datasets: - alakxender/dhivehi-image-text - alakxender/dhivehi-vrd-batch-1-img-questions --- # Dhivehi TrOCR Base V6 A fine-tuned [TrOCR](https://huggingface.co/microsoft/trocr-base-handwritten) model for Dhivehi (Maldivian) text recognition using Thaana script. ## Model Details - **Base model:** microsoft/trocr-base-handwritten - **Parameters:** ~334M - **Training data:** ~695K samples (315K dhivehi-image-text + 380K dhivehi-vrd) - **Best CER:** 0.9% (checkpoint-20000) - **Character tokenizer:** WordLevel (character-level) with EOS ## Usage ```python from transformers import TrOCRProcessor, VisionEncoderDecoderModel, PreTrainedTokenizerFast from PIL import Image import torch processor = TrOCRProcessor.from_pretrained("Serialtechlab/dhivehi-trocr-base-handwritten") model = VisionEncoderDecoderModel.from_pretrained("Serialtechlab/dhivehi-trocr-base-handwritten") tokenizer = PreTrainedTokenizerFast.from_pretrained("Serialtechlab/dhivehi-trocr-base-handwritten") image = Image.open("dhivehi_text.png").convert("RGB") pixel_values = processor(image, return_tensors='pt').pixel_values with torch.no_grad(): generated_ids = model.generate(pixel_values, max_length=128, num_beams=4) tokens = tokenizer.convert_ids_to_tokens(generated_ids[0]) special = [tokenizer.pad_token, tokenizer.bos_token, tokenizer.eos_token, tokenizer.unk_token] text = "".join([t for t in tokens if t not in special]) print(text) ``` ## Training Trained from scratch on Google Colab (A100) for 6 epochs with: - Learning rate: 4e-5 - Batch size: 16 - EOS token appended to all labels - Proper PAD token masking (-100) - Character-level WordLevel tokenizer ## Limitations - Optimized for single text line images (use a text detector like Surya for full pages) - May truncate very long lines (max_length=128 characters) - Best results on printed Dhivehi text; handwritten accuracy varies by style