--- library_name: mlx license: mit tags: - mlx - handwriting - quality - image-classification - apple-silicon pipeline_tag: image-classification datasets: - breitburg/penpal --- # penpal-quality-assurance A small MLX ResNet that scores a 256×256 grayscale handwriting raster on `[0, 1]`: **1 = legible, human-style handwriting**, **0 = corrupted or illegible output**. Trained to filter synthetic handwriting produced by Graves-style generative models before it's used downstream. - ~36k parameters (channels 4 / 8 / 16 / 32 / 32) - Single-file safetensors weights - Apple Silicon / MLX ## Inputs - Shape `[B, 256, 256, 1]` (MLX NHWC), `float32` - `0.0` = background, `1.0` = ink - The renderer in `render.py` (or `graves_handwriting_mlx.quality.render_strokes`) fits each stroke bbox isotropically into the canvas with 12 px padding ## Output Raw logits. Apply `mx.sigmoid` for a probability in `[0, 1]`. ## Usage With the `graves-handwriting-mlx` package installed: ```python import mlx.core as mx from graves_handwriting_mlx.quality import QualityClassifier, render_strokes model = QualityClassifier.from_pretrained("breitburg/penpal-quality-assurance") # `strokes` is the project's nested word -> stroke -> point schema image = render_strokes(strokes) # [256, 256, 1] score = mx.sigmoid(model(mx.array(image)[None]))[0] # float in [0, 1] ``` Without the package, download the weights directly: ```python from huggingface_hub import hf_hub_download weights_path = hf_hub_download("breitburg/penpal-quality-assurance", "weights.safetensors") ``` ## Training data Real (label `1.0`) and corrupted-synthetic (label `0.0`) strokes are rasterized through the same renderer so the classifier cannot use rendering style as a shortcut. - **Positive** — real human handwriting strokes (IAM-OnDB-derived collections) - **Negative** — strokes generated by the Graves model with internal state corruption applied during sampling (attention `κ` scale, attention `β` floor, hidden-state Gaussian noise) in a 10 / 70 / 20 mixture of *very mild / mild / gibberish* corruption ranges - **Mid (label `0.5`)** — clean samples from [`breitburg/penpal`](https://huggingface.co/datasets/breitburg/penpal), which sit between the real and corrupted clusters Loss is BCE-with-logits over the soft `{0.0, 0.5, 1.0}` labels. ## Evaluation Distribution of scores on 500 random rows from each source: | Source | Mean | Median | p10 | p25 | p75 | p90 | ≥0.3 | ≥0.5 | ≥0.7 | ≥0.9 | |---|---|---|---|---|---|---|---|---|---|---| | held-out real handwriting | 0.675 | 0.669 | 0.390 | 0.500 | 0.881 | 0.969 | 96.4 % | 75.0 % | 46.0 % | 22.8 % | | `breitburg/penpal` (clean synthetic) | 0.418 | 0.396 | 0.321 | 0.352 | 0.452 | 0.529 | 100 % | 13.8 % | 3.0 % | 0.6 % | The lowest-scoring penpal rows are genuinely degraded; the highest- scoring rows look indistinguishable from real handwriting. A residual length / scale bias exists (longer texts render smaller and tend to score lower) — acceptable for filtering, but worth knowing. ## Suggested thresholds - `0.3` — lenient: keeps essentially all of penpal, drops only the obvious failures - `0.5` — balanced: drops ~86 % of penpal, keeps 75 % of real - `0.7` — strict: keeps only confidently human-looking rows (~46 % of real) ## Files - `weights.safetensors` — trained parameters - `config.json` — architecture widths and input contract - `model.py` — `QualityClassifier` / `BasicBlock` reference implementation - `render.py` — `render_strokes` for stroke → 256×256 raster ## License MIT.