Koshur Diacritizer ByT5 Small

A ByT5-small model fine-tuned for Kashmiri/Koshur diacritic restoration: non-diacritic Kashmiri text → diacritic Kashmiri text. the average reviewer-rated accuracy of our model is approximately 77.5%. That's a reasonable first-model score for a low-resource diacritization task , the model captures most patterns but still has room to improve on edge cases and truncation issues.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

repo_id = "Omarrran/koshur-diacritizer-byt5-small"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSeq2SeqLM.from_pretrained(repo_id)

text = "کاشر زبان"
inputs = tokenizer(text, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Final metrics

{
  "validation": {
    "loss": 0.061068128794431686,
    "der_marked": 0.10005247507433969,
    "der_all": 0.03764245052568145,
    "wer": 0.1231150319412455,
    "exact_match": 0.24492979719188768,
    "runtime": 156.3775,
    "samples_per_second": 8.198,
    "steps_per_second": 0.262,
    "epoch": 9.992486851990984
  },
  "test": {
    "der_marked": 0.2011514510633298,
    "der_all": 0.14687684306471502,
    "wer": 0.21588209414870216,
    "exact_match": 0.12782608695652173,
    "n_sentences": 1150,
    "n_units": 93255,
    "n_marked": 17022
  }
}

Generated training report

Kashmiri Diacritic Restoration — Run `byt5small-ksdiac-extra-20260613`

1. Method

We cast diacritic restoration as byte-level sequence-to-sequence transduction and fine-tune the latest released model, with the retained training checkpoint at training-checkpoints/checkpoint-6650. The extra-dataset run was initialized from an earlier trained model during training, but only the final retained checkpoint is kept in the Hub repo. Byte-level modelling avoids subword tokenisers that corrupt Perso-Arabic combining marks. Input is the un-diacritised (bare) skeleton; the target is the fully diacritised form. At inference a skeleton guard rejects any output that alters the consonant skeleton, so the model can only add marks.

2. Data understanding

Column diacritic densities: {'input_text': 0.0, 'target_text': 0.13561}
Chosen input column: input_text; target column: target_text
Learned letter fold (10 entries): {'ٲ': 'ا', 'ؤ': 'و', 'آ': 'ا', 'ئ': 'ی', 'أ': 'ا', 'ۂ': 'ہ', 'ۓ': 'ے', 'ٳ': 'ا', 'ٱ': 'ا', 'إ': 'ا'}
Alignment survival: 82.1% (kept 23727/28891; misaligned=16, dup=1974, len=3174)

Split statistics

split	rows	mean chars	p95 chars	diac. density
train	21295	113.6	191	0.1243
validation	1282	116.3	193	0.1246
test	1150	114.7	193	0.1237

3. Training configuration

setting	value
model	Omarrran/koshur-diacritizer-byt5-small; retained checkpoint: training-checkpoints/checkpoint-6650
epochs	10.0
lr	0.0005
effective batch	32
precision	bf16
scheduler	cosine
max len (bytes)	256
GPU	NVIDIA L4
torch	2.4.1+cu121
transformers	4.44.2

4. Results

metric	validation	test
DER (marked)	0.1001	0.2012
DER (all)	0.0376	0.1469
WER	0.1231	0.2159
Exact match	0.2449	0.1278

Test set: 1150 sentences, 93255 letters (17022 diacritised).

DER (marked) — the headline metric — is the error rate over letters that carry a diacritic in the reference. Lower is better.

6. Qualitative samples

7. Reproducibility

All artefacts are in this directory: run_config.json, dataset_stats.json, history.jsonl, metrics.json, predictions.jsonl, confusion.json, and model/.

Downloads last month: 474

Model tree for Omarrran/koshur-diacritizer-byt5-small

Base model

google/byt5-small

Finetuned

(253)

this model

Dataset used to train Omarrran/koshur-diacritizer-byt5-small

Space using Omarrran/koshur-diacritizer-byt5-small 1

Evaluation results

Test DER-marked on Combined Kashmiri Non-Diacritic to Diacritic Parallel Dataset
self-reported

0.2012 (lower is better)
Human review eval on Combined Kashmiri Non-Diacritic to Diacritic Parallel Dataset
self-reported

77.6%