--- license: apache-2.0 language: - dv tags: - ocr - dhivehi - thaana - paligemma - vision-language - text-recognition base_model: google/paligemma2-3b-pt-224 datasets: - alakxender/dhivehi-vrd-images metrics: - accuracy library_name: transformers --- # paligemma2-dhivehi-ocr-full ## Model Description This is a fine-tuned PaliGemma model for Dhivehi (Thaana script) Optical Character Recognition (OCR). The model has been merged from a LoRA adapter into a standalone model for easy deployment. **Original adapter:** alakxender/paligemma2-qlora-dhivehi-ocr-224-sl-md-16k **Base model:** google/paligemma2-3b-pt-224 **Merged on:** 2025-06-29 09:02:20 ## Capabilities - Extract Dhivehi/Thaana text from images - Handle both single-line and multi-line text - Optimized for printed Dhivehi text recognition - Works with various image formats and qualities ## Usage ```python import torch from PIL import Image from transformers import AutoProcessor, PaliGemmaForConditionalGeneration # Load the merged model (no base model loading required!) model_id = "Serialtechlab/paligemma2-dhivehi-ocr-full" model = PaliGemmaForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) processor = AutoProcessor.from_pretrained(model_id) # Load your image image = Image.open("your_image.png") # Prepare inputs prompt = "What text is written in this image?" inputs = processor(text=prompt, images=image, return_tensors="pt") # Move to GPU for k, v in inputs.items(): if k == "pixel_values": inputs[k] = v.to(torch.bfloat16).to("cuda") else: inputs[k] = v.to("cuda") # Generate with torch.inference_mode(): outputs = model.generate( **inputs, max_new_tokens=500, do_sample=False ) # Decode result result = processor.batch_decode(outputs, skip_special_tokens=True)[0] dhivehi_text = result.replace(prompt, "").strip() print(f"Extracted text: " + dhivehi_text) ``` ## Model Details - **Architecture:** PaliGemma (Vision-Language Model) - **Fine-tuning:** LoRA (Low-Rank Adaptation) - **Training data:** Dhivehi text images - **Language:** Dhivehi (Thaana script) - **Model size:** ~5.9GB (merged weights) ## Performance This model provides accurate Dhivehi text extraction from images with good performance on: - Printed text - Various font sizes - Different image qualities - Single and multi-line text layouts ## Limitations - Optimized for printed text (handwritten text may have lower accuracy) - Performance depends on image quality and text clarity - Best results with high-contrast, clear images ## Training Details - **Base model:** google/paligemma2-3b-pt-224 - **Fine-tuning method:** LoRA (Low-Rank Adaptation) - **Target modules:** Vision and language model layers - **Rank:** 16 - **Alpha:** 32 ## Citation If you use this model, please cite: ```bibtex @misc{dhivehi-ocr-paligemma, title={Dhivehi OCR with PaliGemma}, author={Serialtechlab}, year={2024}, howpublished={\url{https://huggingface.co/Serialtechlab/paligemma2-dhivehi-ocr-full}} } ``` ## License This model is released under the Apache 2.0 license, following the base model's licensing terms.