--- language: - de license: apache-2.0 library_name: transformers base_model: lightonai/LightOnOCR-2-1B-base tags: - ocr - vision-language - lightonocr - document-understanding - german - shorthand - manuscript - medieval datasets: - medieval-data/german-shorthand-line pipeline_tag: image-text-to-text --- # LightOnOCR-2-1B for German (Line-Level)

LightOnOCR Banner

This model is a **fine-tuned version of [lightonai/LightOnOCR-2-1B-base](https://huggingface.co/lightonai/LightOnOCR-2-1B-base)** specifically trained for **line-level OCR**. German shorthand manuscript line-level OCR ## Model Description - **Base Model:** [lightonai/LightOnOCR-2-1B-base](https://huggingface.co/lightonai/LightOnOCR-2-1B-base) - **Training Data:** [medieval-data/german-shorthand-line](https://huggingface.co/datasets/medieval-data/german-shorthand-line) - **Task:** Line-level text transcription from document images - **Language:** German (de) - **Architecture:** Vision-Language Model (1B parameters) This is a **line-level model** - it expects cropped line images as input, not full pages. Each image should contain a single line of text. ## Evaluation Results Evaluated on 50 samples from the test set: | Metric | Base Model | **Finetuned** | Improvement | |--------|------------|---------------|-------------| | CER (%) | 381.26 | **21.89** | +359.37 | | WER (%) | 494.99 | **37.41** | +457.58 | | Perfect Matches | 0 | **0** | +0 | *Lower CER/WER is better. Higher perfect matches is better.* ### Example Outputs | # | Ground Truth | Base Model | **Finetuned** | |---|--------------|------------|---------------| | 1 | (Haupt der seligen Irmeng. gefunden. Im ... | 12/12/1998 10:00 AM 10:00 AM 10:00 AM 10... | (Haupt der seitdem Jänner 12 20 bei Daue... | | 2 | Schw. Reinh.: Ist vom Lagerdienst freige... | Schw. Reinh. : 2d 9.20 16 09 J. 6 | Schw. Reinh.: Ist vom Lagerdienst frei g... | | 3 | Klage daß im Naz.heim den Kranken die Ko... | $$ \begin{aligned} & \text { 22 e 2 haz.... | Klage daß im Naz.heim den Kranken die Ko... | | 4 | Irene: Stimmung sehr verschieden. Kommen... | | Irene: Stimmung sehr verschiedenes. Münd... | | 5 | Zwei Schwestern Calabrien: M. Cristina u... | 226 *Kolabrie: M. Cisneros, Urode* | Zwei Schwestern Katalrien: M. Cristina u... | *✓ = exact match* ## Usage ### Installation ```bash # Requires transformers from source pip install git+https://github.com/huggingface/transformers pip install pillow torch ``` ### Python Usage ```python import torch from transformers import LightOnOcrForConditionalGeneration, LightOnOcrProcessor from PIL import Image # Load model and processor model_id = "wjbmattingly/LightOnOCR-2-1B-german-shorthand-line" device = "cuda" if torch.cuda.is_available() else "cpu" dtype = torch.bfloat16 if device == "cuda" else torch.float32 processor = LightOnOcrProcessor.from_pretrained(model_id) model = LightOnOcrForConditionalGeneration.from_pretrained( model_id, torch_dtype=dtype, ).to(device) # Load your line image image = Image.open("your_image.jpg").convert("RGB") # Prepare input messages = [{"role": "user", "content": [{"type": "image"}]}] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor( text=[text], images=[[image]], return_tensors="pt", padding=True, size={"longest_edge": 700}, ).to(device) inputs["pixel_values"] = inputs["pixel_values"].to(dtype) # Generate transcription with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False) # Decode output input_length = inputs["input_ids"].shape[1] generated_ids = outputs[0, input_length:] transcription = processor.decode(generated_ids, skip_special_tokens=True) print(transcription) ``` ### Batch Inference ```python from datasets import load_dataset # Load dataset dataset = load_dataset("medieval-data/german-shorthand-line", split="train[:10]") # Process batch images = [[img.convert("RGB")] for img in dataset["image"]] messages = [{"role": "user", "content": [{"type": "image"}]}] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) texts = [text] * len(images) inputs = processor( text=texts, images=images, return_tensors="pt", padding=True, size={"longest_edge": 700}, ).to(device) inputs["pixel_values"] = inputs["pixel_values"].to(dtype) outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False) predictions = processor.batch_decode(outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True) for pred, gt in zip(predictions, dataset["text"]): print(f"Prediction: {pred}") print(f"Ground Truth: {gt}") print() ``` ## Training Details - **Base Model:** [lightonai/LightOnOCR-2-1B-base](https://huggingface.co/lightonai/LightOnOCR-2-1B-base) - **Training Method:** Fine-tuning with frozen language model backbone - **Optimizer:** AdamW (fused) - **Learning Rate:** 6e-5 with linear decay - **Precision:** bfloat16 ## Limitations - This model is trained on **line-level images**. For full-page transcription, you need to first segment the page into individual lines. - Performance may vary on document styles not represented in the training data. ## Citation If you use this model, please cite: ```bibtex @misc{lightonocr2_finetuned_2026, title = {LightOnOCR Fine-tuned for German}, author = {William Mattingly}, year = {2026}, howpublished = {\url{https://huggingface.co/wjbmattingly/LightOnOCR-2-1B-german-shorthand-line}} } ``` And the original LightOnOCR paper: ```bibtex @misc{lightonocr2_2026, title = {LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR}, author = {Said Taghadouini and Adrien Cavaill\`{e}s and Baptiste Aubertin}, year = {2026}, howpublished = {\url{https://arxiv.org/pdf/2601.14251}} } ``` ## Acknowledgments - [LightOn AI](https://www.lighton.ai/) for the excellent LightOnOCR base model - The creators of the [medieval-data/german-shorthand-line](https://huggingface.co/datasets/medieval-data/german-shorthand-line) dataset