Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,98 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: mit
|
| 4 |
+
tags:
|
| 5 |
+
- word-detection
|
| 6 |
+
- handwriting
|
| 7 |
+
- onnx
|
| 8 |
+
- xournalpp
|
| 9 |
+
library_name: onnxruntime
|
| 10 |
+
pipeline_tag: object-detection
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# WordDetector — Word-Level Bounding Box Detection for Handwritten Text
|
| 14 |
+
|
| 15 |
+
A word-detection model that locates individual handwritten words in document
|
| 16 |
+
images. It produces axis-aligned bounding boxes — no transcription or labels.
|
| 17 |
+
Part of the [Xournal++ HTR](https://github.com/PellelNitram/xournalpp_htr)
|
| 18 |
+
project.
|
| 19 |
+
|
| 20 |
+
## Model details
|
| 21 |
+
|
| 22 |
+
| Property | Value |
|
| 23 |
+
| --- | --- |
|
| 24 |
+
| Architecture | Modified ResNet-18 encoder + U-Net-style decoder |
|
| 25 |
+
| Input | Grayscale image, resized to 448×448 |
|
| 26 |
+
| Output | 7 feature maps at 224×224 (segmentation + geometry) |
|
| 27 |
+
| Format | ONNX (softmax baked in, opset 17) |
|
| 28 |
+
| Parameters | ~11.2M |
|
| 29 |
+
| Training data | [IAM Handwriting Database](https://huggingface.co/datasets/PellelNitram/xournalpp_htr_IAM_DB) |
|
| 30 |
+
| Best val F1 | 0.88 (lr=0.001, bs=16, 200 epochs) |
|
| 31 |
+
| License | MIT |
|
| 32 |
+
|
| 33 |
+
## Usage
|
| 34 |
+
|
| 35 |
+
```python
|
| 36 |
+
from xournalpp_htr.inference_models import WordDetectorModel
|
| 37 |
+
|
| 38 |
+
model = WordDetectorModel.from_pretrained()
|
| 39 |
+
boxes = model.detect(grayscale_image) # list[BoundingBox]
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
Each `BoundingBox` has `x_min`, `y_min`, `x_max`, `y_max` in the original
|
| 43 |
+
image's pixel coordinates.
|
| 44 |
+
|
| 45 |
+
Requires `pip install xournalpp-htr` (pulls `onnxruntime` and
|
| 46 |
+
`huggingface-hub`, no PyTorch needed).
|
| 47 |
+
|
| 48 |
+
## How it works
|
| 49 |
+
|
| 50 |
+
The model outputs 7 maps per image:
|
| 51 |
+
|
| 52 |
+
- **Segmentation** (3 channels): word / surrounding margin / background
|
| 53 |
+
(softmax classification)
|
| 54 |
+
- **Geometry** (4 channels): per-pixel distance to the top, bottom, left, and
|
| 55 |
+
right edges of the enclosing word bounding box
|
| 56 |
+
|
| 57 |
+
Post-processing decodes these maps into bounding boxes via connected-component
|
| 58 |
+
analysis and DBSCAN clustering.
|
| 59 |
+
|
| 60 |
+
## Training
|
| 61 |
+
|
| 62 |
+
Trained on the IAM Handwriting Database with an 80/20 random split. The best
|
| 63 |
+
model was selected via a hyperparameter grid search over learning rates
|
| 64 |
+
(0.0005, 0.001, 0.002) and batch sizes (16, 32, 64, 128) with early stopping
|
| 65 |
+
(patience=50).
|
| 66 |
+
|
| 67 |
+
| Hyperparameter | Value |
|
| 68 |
+
| --- | --- |
|
| 69 |
+
| Optimizer | Adam |
|
| 70 |
+
| Learning rate | 0.001 |
|
| 71 |
+
| Batch size | 16 |
|
| 72 |
+
| Max epochs | 200 |
|
| 73 |
+
| Loss | Cross-entropy (segmentation) + IoU (geometry) |
|
| 74 |
+
|
| 75 |
+
Full training instructions:
|
| 76 |
+
[README](https://github.com/PellelNitram/xournalpp_htr/blob/master/xournalpp_htr/training/word_detector/README.md).
|
| 77 |
+
|
| 78 |
+
## Intended use
|
| 79 |
+
|
| 80 |
+
This model is the detection stage in a handwriting recognition pipeline. It is
|
| 81 |
+
designed to run on personal devices (laptops, edge) via ONNX Runtime — no GPU
|
| 82 |
+
required for inference. A separate transcription model (not yet available)
|
| 83 |
+
would read the detected word regions.
|
| 84 |
+
|
| 85 |
+
## Limitations
|
| 86 |
+
|
| 87 |
+
- Detection only — no text transcription.
|
| 88 |
+
- Grayscale input required.
|
| 89 |
+
- Fixed 448×448 resize may distort aspect ratio on non-square images.
|
| 90 |
+
- No training-time data augmentation (planned improvement).
|
| 91 |
+
- Validated on IAM-style handwriting; performance on other styles (e.g.
|
| 92 |
+
historical documents) may vary.
|
| 93 |
+
|
| 94 |
+
## Citation
|
| 95 |
+
|
| 96 |
+
The architecture is based on
|
| 97 |
+
[WordDetectorNN](https://github.com/githubharald/WordDetectorNN) by
|
| 98 |
+
[Harald Scheidl](https://github.com/githubharald).
|