---
language: en
license: mit
tags:
  - word-detection
  - handwriting
  - onnx
  - xournalpp
library_name: onnxruntime
pipeline_tag: object-detection
---

# WordDetector — Word-Level Bounding Box Detection for Handwritten Text

A word-detection model that locates individual handwritten words in document
images. It produces axis-aligned bounding boxes — no transcription or labels.
Part of the [Xournal++ HTR](https://github.com/PellelNitram/xournalpp_htr)
project.

## Model details

| Property | Value |
| --- | --- |
| Architecture | Modified ResNet-18 encoder + U-Net-style decoder |
| Input | Grayscale image, resized to 448×448 |
| Output | 7 feature maps at 224×224 (segmentation + geometry) |
| Format | ONNX (softmax baked in, opset 17) |
| Parameters | ~11.2M |
| Training data | [IAM Handwriting Database](https://huggingface.co/datasets/PellelNitram/xournalpp_htr_IAM_DB) |
| Best val F1 | 0.88 (lr=0.001, bs=16, 200 epochs) |
| License | MIT |

## Usage

```python
from xournalpp_htr.inference_models import WordDetectorModel

model = WordDetectorModel.from_pretrained()
boxes = model.detect(grayscale_image)  # list[BoundingBox]
```

Each `BoundingBox` has `x_min`, `y_min`, `x_max`, `y_max` in the original
image's pixel coordinates.

Requires `pip install xournalpp-htr` (pulls `onnxruntime` and
`huggingface-hub`, no PyTorch needed).

## How it works

The model outputs 7 maps per image:

- **Segmentation** (3 channels): word / surrounding margin / background
  (softmax classification)
- **Geometry** (4 channels): per-pixel distance to the top, bottom, left, and
  right edges of the enclosing word bounding box

Post-processing decodes these maps into bounding boxes via connected-component
analysis and DBSCAN clustering.

## Training

Trained on the IAM Handwriting Database with an 80/20 random split. The best
model was selected via a hyperparameter grid search over learning rates
(0.0005, 0.001, 0.002) and batch sizes (16, 32, 64, 128) with early stopping
(patience=50).

| Hyperparameter | Value |
| --- | --- |
| Optimizer | Adam |
| Learning rate | 0.001 |
| Batch size | 16 |
| Max epochs | 200 |
| Loss | Cross-entropy (segmentation) + IoU (geometry) |

Full training instructions:
[README](https://github.com/PellelNitram/xournalpp_htr/blob/master/xournalpp_htr/training/word_detector/README.md).

## Intended use

This model is the detection stage in a handwriting recognition pipeline. It is
designed to run on personal devices (laptops, edge) via ONNX Runtime — no GPU
required for inference. A separate transcription model (not yet available)
would read the detected word regions.

## Limitations

- Detection only — no text transcription.
- Grayscale input required.
- Fixed 448×448 resize may distort aspect ratio on non-square images.
- No training-time data augmentation (planned improvement).
- Validated on IAM-style handwriting; performance on other styles (e.g.
  historical documents) may vary.

## Citation

The architecture is based on
[WordDetectorNN](https://github.com/githubharald/WordDetectorNN) by
[Harald Scheidl](https://github.com/githubharald).