---
library_name: mlx
license: mit
tags:
  - mlx
  - handwriting
  - quality
  - image-classification
  - apple-silicon
pipeline_tag: image-classification
datasets:
  - breitburg/penpal
---

# penpal-quality-assurance

A small MLX ResNet that scores a 256×256 grayscale handwriting raster on
`[0, 1]`: **1 = legible, human-style handwriting**, **0 = corrupted or
illegible output**. Trained to filter synthetic handwriting produced by
Graves-style generative models before it's used downstream.

- ~36k parameters (channels 4 / 8 / 16 / 32 / 32)
- Single-file safetensors weights
- Apple Silicon / MLX

## Inputs

- Shape `[B, 256, 256, 1]` (MLX NHWC), `float32`
- `0.0` = background, `1.0` = ink
- The renderer in `render.py` (or `graves_handwriting_mlx.quality.render_strokes`)
  fits each stroke bbox isotropically into the canvas with 12 px padding

## Output

Raw logits. Apply `mx.sigmoid` for a probability in `[0, 1]`.

## Usage

With the `graves-handwriting-mlx` package installed:

```python
import mlx.core as mx
from graves_handwriting_mlx.quality import QualityClassifier, render_strokes

model = QualityClassifier.from_pretrained("breitburg/penpal-quality-assurance")

# `strokes` is the project's nested word -> stroke -> point schema
image = render_strokes(strokes)                       # [256, 256, 1]
score = mx.sigmoid(model(mx.array(image)[None]))[0]   # float in [0, 1]
```

Without the package, download the weights directly:

```python
from huggingface_hub import hf_hub_download
weights_path = hf_hub_download("breitburg/penpal-quality-assurance", "weights.safetensors")
```

## Training data

Real (label `1.0`) and corrupted-synthetic (label `0.0`) strokes are
rasterized through the same renderer so the classifier cannot use
rendering style as a shortcut.

- **Positive** — real human handwriting strokes (IAM-OnDB-derived
  collections)
- **Negative** — strokes generated by the Graves model with internal
  state corruption applied during sampling (attention `κ` scale,
  attention `β` floor, hidden-state Gaussian noise) in a 10 / 70 / 20
  mixture of *very mild / mild / gibberish* corruption ranges
- **Mid (label `0.5`)** — clean samples from
  [`breitburg/penpal`](https://huggingface.co/datasets/breitburg/penpal),
  which sit between the real and corrupted clusters

Loss is BCE-with-logits over the soft `{0.0, 0.5, 1.0}` labels.

## Evaluation

Distribution of scores on 500 random rows from each source:

| Source | Mean | Median | p10 | p25 | p75 | p90 | ≥0.3 | ≥0.5 | ≥0.7 | ≥0.9 |
|---|---|---|---|---|---|---|---|---|---|---|
| held-out real handwriting | 0.675 | 0.669 | 0.390 | 0.500 | 0.881 | 0.969 | 96.4 % | 75.0 % | 46.0 % | 22.8 % |
| `breitburg/penpal` (clean synthetic) | 0.418 | 0.396 | 0.321 | 0.352 | 0.452 | 0.529 | 100 % | 13.8 % | 3.0 % | 0.6 % |

The lowest-scoring penpal rows are genuinely degraded; the highest-
scoring rows look indistinguishable from real handwriting. A residual
length / scale bias exists (longer texts render smaller and tend to
score lower) — acceptable for filtering, but worth knowing.

## Suggested thresholds

- `0.3` — lenient: keeps essentially all of penpal, drops only the obvious failures
- `0.5` — balanced: drops ~86 % of penpal, keeps 75 % of real
- `0.7` — strict: keeps only confidently human-looking rows (~46 % of real)

## Files

- `weights.safetensors` — trained parameters
- `config.json` — architecture widths and input contract
- `model.py` — `QualityClassifier` / `BasicBlock` reference implementation
- `render.py` — `render_strokes` for stroke → 256×256 raster

## License

MIT.