--- language: en license: mit tags: - word-detection - handwriting - onnx - xournalpp library_name: onnxruntime pipeline_tag: object-detection --- # WordDetector — Word-Level Bounding Box Detection for Handwritten Text A word-detection model that locates individual handwritten words in document images. It produces axis-aligned bounding boxes — no transcription or labels. Part of the [Xournal++ HTR](https://github.com/PellelNitram/xournalpp_htr) project. ## Model details | Property | Value | | --- | --- | | Architecture | Modified ResNet-18 encoder + U-Net-style decoder | | Input | Grayscale image, resized to 448×448 | | Output | 7 feature maps at 224×224 (segmentation + geometry) | | Format | ONNX (softmax baked in, opset 17) | | Parameters | ~11.2M | | Training data | [IAM Handwriting Database](https://huggingface.co/datasets/PellelNitram/xournalpp_htr_IAM_DB) | | Best val F1 | 0.88 (lr=0.001, bs=16, 200 epochs) | | License | MIT | ## Usage ```python from xournalpp_htr.inference_models import WordDetectorModel model = WordDetectorModel.from_pretrained() boxes = model.detect(grayscale_image) # list[BoundingBox] ``` Each `BoundingBox` has `x_min`, `y_min`, `x_max`, `y_max` in the original image's pixel coordinates. Requires `pip install xournalpp-htr` (pulls `onnxruntime` and `huggingface-hub`, no PyTorch needed). ## How it works The model outputs 7 maps per image: - **Segmentation** (3 channels): word / surrounding margin / background (softmax classification) - **Geometry** (4 channels): per-pixel distance to the top, bottom, left, and right edges of the enclosing word bounding box Post-processing decodes these maps into bounding boxes via connected-component analysis and DBSCAN clustering. ## Training Trained on the IAM Handwriting Database with an 80/20 random split. The best model was selected via a hyperparameter grid search over learning rates (0.0005, 0.001, 0.002) and batch sizes (16, 32, 64, 128) with early stopping (patience=50). | Hyperparameter | Value | | --- | --- | | Optimizer | Adam | | Learning rate | 0.001 | | Batch size | 16 | | Max epochs | 200 | | Loss | Cross-entropy (segmentation) + IoU (geometry) | Full training instructions: [README](https://github.com/PellelNitram/xournalpp_htr/blob/master/xournalpp_htr/training/word_detector/README.md). ## Intended use This model is the detection stage in a handwriting recognition pipeline. It is designed to run on personal devices (laptops, edge) via ONNX Runtime — no GPU required for inference. A separate transcription model (not yet available) would read the detected word regions. ## Limitations - Detection only — no text transcription. - Grayscale input required. - Fixed 448×448 resize may distort aspect ratio on non-square images. - No training-time data augmentation (planned improvement). - Validated on IAM-style handwriting; performance on other styles (e.g. historical documents) may vary. ## Citation The architecture is based on [WordDetectorNN](https://github.com/githubharald/WordDetectorNN) by [Harald Scheidl](https://github.com/githubharald).