--- language: ['ar'] tags: - diacritization - nlp - arabic metrics: - DER - WER - SER --- # Automatic Restoration of Diacritics for Speech Data Sets This is a transformer-baed model for Arabic text diacritization as described [here](https://github.com/rufaelfekadu/Diac.git). ## Evaluation Results ### Evaluation on clartts DER (Diacritic Error Rate) | Configuration | With case ending | Without case ending | |---|---|---| | **Including no diacritic** | 10.33% | 8.45% | | **Excluding no diacritic** | 12.72% | 10.33% | WER (Word Error Rate) | Configuration | With case ending | Without case ending | |---|---|---| | **Including no diacritic** | 30.16% | 19.71% | | **Excluding no diacritic** | 29.91% | 19.60% | ## How to Use ### Installation ```bash git clone https://github.com/rufaelfekadu/diac.git cd diac pip install -e . ``` ### Loading the Model ```python from diac.models import DiacritizationModule model = DiacritizationModule.from_pretrained( "rufaelfekadu/diac-transformer-text-only-tashkeela", tokenizer_constants_path="constants/" # Path to constants directory ) ``` ### Running Inference ```python # Predict diacritization for a text file model.predict_file( input_file="path/to/input.txt", output_file="path/to/output.txt" ) # Or predict for a single text string diacritized_text = model.predict_text("مرحبا بك") ``` ### Running Evaluation To evaluate the model on your own test set: 1. **Run inference** to generate predictions: ```bash python inference.py \ --config configs/.yml \ --opts \ DATA.TEST_PATH path/to/test.txt \ INFERENCE.MODEL_PATH \ INFERENCE.OUTPUT_PATH path/to/predictions.txt ``` 2. **Prepare reference file** (if needed): ```bash python src/diac/utils/prep_ref.py \ --input_file path/to/test.txt \ -o path/to/output_dir ``` 3. **Calculate metrics** (DER, WER, SER): ```bash python src/diac/utils/eval.py \ -ofp path/to/predictions.txt \ -tfp path/to/reference.txt \ --style Fadel ``` The evaluation script will output DER, WER, and SER metrics with different configurations: - With/without case ending - Including/excluding no diacritic