Koshur Diacritizer ByT5 Small

A ByT5-small model fine-tuned for Kashmiri/Koshur diacritic restoration: non-diacritic Kashmiri text → diacritic Kashmiri text. the average reviewer-rated accuracy of our model is approximately 77.5%. That's a reasonable first-model score for a low-resource diacritization task , the model captures most patterns but still has room to improve on edge cases and truncation issues.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

repo_id = "Omarrran/koshur-diacritizer-byt5-small"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSeq2SeqLM.from_pretrained(repo_id)

text = "کاشر زبان"
inputs = tokenizer(text, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Final metrics

{
  "validation": {
    "loss": 0.061068128794431686,
    "der_marked": 0.10005247507433969,
    "der_all": 0.03764245052568145,
    "wer": 0.1231150319412455,
    "exact_match": 0.24492979719188768,
    "runtime": 156.3775,
    "samples_per_second": 8.198,
    "steps_per_second": 0.262,
    "epoch": 9.992486851990984
  },
  "test": {
    "der_marked": 0.2011514510633298,
    "der_all": 0.14687684306471502,
    "wer": 0.21588209414870216,
    "exact_match": 0.12782608695652173,
    "n_sentences": 1150,
    "n_units": 93255,
    "n_marked": 17022
  }
}

Generated training report

Kashmiri Diacritic Restoration — Run byt5small-ksdiac-extra-20260613

1. Method

We cast diacritic restoration as byte-level sequence-to-sequence transduction and fine-tune the latest released model, with the retained training checkpoint at training-checkpoints/checkpoint-6650. The extra-dataset run was initialized from an earlier trained model during training, but only the final retained checkpoint is kept in the Hub repo. Byte-level modelling avoids subword tokenisers that corrupt Perso-Arabic combining marks. Input is the un-diacritised (bare) skeleton; the target is the fully diacritised form. At inference a skeleton guard rejects any output that alters the consonant skeleton, so the model can only add marks.

2. Data understanding

  • Column diacritic densities: {'input_text': 0.0, 'target_text': 0.13561}
  • Chosen input column: input_text; target column: target_text
  • Learned letter fold (10 entries): {'ٲ': 'ا', 'ؤ': 'و', 'آ': 'ا', 'ئ': 'ی', 'أ': 'ا', 'ۂ': 'ہ', 'ۓ': 'ے', 'ٳ': 'ا', 'ٱ': 'ا', 'إ': 'ا'}
  • Alignment survival: 82.1% (kept 23727/28891; misaligned=16, dup=1974, len=3174)

Split statistics

split rows mean chars p95 chars diac. density
train 21295 113.6 191 0.1243
validation 1282 116.3 193 0.1246
test 1150 114.7 193 0.1237

3. Training configuration

setting value
model Omarrran/koshur-diacritizer-byt5-small; retained checkpoint: training-checkpoints/checkpoint-6650
epochs 10.0
lr 0.0005
effective batch 32
precision bf16
scheduler cosine
max len (bytes) 256
GPU NVIDIA L4
torch 2.4.1+cu121
transformers 4.44.2

4. Results

metric validation test
DER (marked) 0.1001 0.2012
DER (all) 0.0376 0.1469
WER 0.1231 0.2159
Exact match 0.2449 0.1278

Test set: 1150 sentences, 93255 letters (17022 diacritised).

DER (marked) — the headline metric — is the error rate over letters that carry a diacritic in the reference. Lower is better.

6. Qualitative samples

image

7. Reproducibility

All artefacts are in this directory: run_config.json, dataset_stats.json, history.jsonl, metrics.json, predictions.jsonl, confusion.json, and model/.

Downloads last month
474
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Omarrran/koshur-diacritizer-byt5-small

Finetuned
(253)
this model

Dataset used to train Omarrran/koshur-diacritizer-byt5-small

Space using Omarrran/koshur-diacritizer-byt5-small 1

Evaluation results

  • Test DER-marked on Combined Kashmiri Non-Diacritic to Diacritic Parallel Dataset
    self-reported
    0.2012 (lower is better)
  • Human review eval on Combined Kashmiri Non-Diacritic to Diacritic Parallel Dataset
    self-reported
    77.6%