PellelNitram
/

xournalpp-htr-word-detector

+---
+language: en
+license: mit
+tags:
+  - word-detection
+  - handwriting
+  - onnx
+  - xournalpp
+library_name: onnxruntime
+pipeline_tag: object-detection
+---
+# WordDetector — Word-Level Bounding Box Detection for Handwritten Text
+A word-detection model that locates individual handwritten words in document
+images. It produces axis-aligned bounding boxes — no transcription or labels.
+Part of the [Xournal++ HTR](https://github.com/PellelNitram/xournalpp_htr)
+project.
+## Model details
+| Property | Value |
+| --- | --- |
+| Architecture | Modified ResNet-18 encoder + U-Net-style decoder |
+| Input | Grayscale image, resized to 448×448 |
+| Output | 7 feature maps at 224×224 (segmentation + geometry) |
+| Format | ONNX (softmax baked in, opset 17) |
+| Parameters | ~11.2M |
+| Training data | [IAM Handwriting Database](https://huggingface.co/datasets/PellelNitram/xournalpp_htr_IAM_DB) |
+| Best val F1 | 0.88 (lr=0.001, bs=16, 200 epochs) |
+| License | MIT |
+## Usage
+```python
+from xournalpp_htr.inference_models import WordDetectorModel
+model = WordDetectorModel.from_pretrained()
+boxes = model.detect(grayscale_image)  # list[BoundingBox]
+```
+Each `BoundingBox` has `x_min`, `y_min`, `x_max`, `y_max` in the original
+image's pixel coordinates.
+Requires `pip install xournalpp-htr` (pulls `onnxruntime` and
+`huggingface-hub`, no PyTorch needed).
+## How it works
+The model outputs 7 maps per image:
+- **Segmentation** (3 channels): word / surrounding margin / background
+  (softmax classification)
+- **Geometry** (4 channels): per-pixel distance to the top, bottom, left, and
+  right edges of the enclosing word bounding box
+Post-processing decodes these maps into bounding boxes via connected-component
+analysis and DBSCAN clustering.
+## Training
+Trained on the IAM Handwriting Database with an 80/20 random split. The best
+model was selected via a hyperparameter grid search over learning rates
+(0.0005, 0.001, 0.002) and batch sizes (16, 32, 64, 128) with early stopping
+(patience=50).
+| Hyperparameter | Value |
+| --- | --- |
+| Optimizer | Adam |
+| Learning rate | 0.001 |
+| Batch size | 16 |
+| Max epochs | 200 |
+| Loss | Cross-entropy (segmentation) + IoU (geometry) |
+Full training instructions:
+[README](https://github.com/PellelNitram/xournalpp_htr/blob/master/xournalpp_htr/training/word_detector/README.md).
+## Intended use
+This model is the detection stage in a handwriting recognition pipeline. It is
+designed to run on personal devices (laptops, edge) via ONNX Runtime — no GPU
+required for inference. A separate transcription model (not yet available)
+would read the detected word regions.
+## Limitations
+- Detection only — no text transcription.
+- Grayscale input required.
+- Fixed 448×448 resize may distort aspect ratio on non-square images.
+- No training-time data augmentation (planned improvement).
+- Validated on IAM-style handwriting; performance on other styles (e.g.
+  historical documents) may vary.
+## Citation
+The architecture is based on
+[WordDetectorNN](https://github.com/githubharald/WordDetectorNN) by
+[Harald Scheidl](https://github.com/githubharald).