| --- |
| language: ['ar'] |
| tags: |
| - diacritization |
| - nlp |
| - arabic |
| - transformer |
| metrics: |
| - DER |
| - WER |
| - SER |
| --- |
| |
| # Automatic Restoration of Diacritics for Speech Data Sets |
|
|
| This is a transformer-baed model for Arabic text diacritization as described [here](https://github.com/rufaelfekadu/Diac.git). |
|
|
| ## Evaluation Results |
|
|
| ### Evaluation on clartts |
|
|
| DER (Diacritic Error Rate) |
|
|
| | Configuration | With case ending | Without case ending | |
| |---|---|---| |
| | **Including no diacritic** | 10.33% | 8.45% | |
| | **Excluding no diacritic** | 12.72% | 10.33% | |
|
|
|
|
| WER (Word Error Rate) |
|
|
| | Configuration | With case ending | Without case ending | |
| |---|---|---| |
| | **Including no diacritic** | 30.16% | 19.71% | |
| | **Excluding no diacritic** | 29.91% | 19.60% | |
|
|
|
|
| ## How to Use |
|
|
| ### Installation |
|
|
| ```bash |
| git clone https://github.com/rufaelfekadu/diac.git |
| cd diac |
| pip install -e . |
| ``` |
|
|
| ### Loading the Model |
|
|
| ```python |
| from diac.models import DiacritizationModule |
| |
| model = DiacritizationModule.from_pretrained( |
| "rufaelfekadu/diac-transformer-text-only-tashkeela", |
| tokenizer_constants_path="constants/" # Path to constants directory |
| ) |
| ``` |
|
|
| ### Running Inference |
|
|
| ```python |
| # Predict diacritization for a text file |
| model.predict_file( |
| input_file="path/to/input.txt", |
| output_file="path/to/output.txt" |
| ) |
| |
| # Or predict for a single text string |
| diacritized_text = model.predict_text("مرحبا بك") |
| ``` |
|
|
| ### Running Evaluation |
|
|
| To evaluate the model on your own test set: |
|
|
| 1. **Run inference** to generate predictions: |
|
|
| ```bash |
| python inference.py \ |
| --config configs/<model>.yml \ |
| --opts \ |
| DATA.TEST_PATH path/to/test.txt \ |
| INFERENCE.MODEL_PATH <path_to_checkpoint> \ |
| INFERENCE.OUTPUT_PATH path/to/predictions.txt |
| ``` |
|
|
| 2. **Prepare reference file** (if needed): |
|
|
| ```bash |
| python src/diac/utils/prep_ref.py \ |
| --input_file path/to/test.txt \ |
| -o path/to/output_dir |
| ``` |
|
|
| 3. **Calculate metrics** (DER, WER, SER): |
|
|
| ```bash |
| python src/diac/utils/eval.py \ |
| -ofp path/to/predictions.txt \ |
| -tfp path/to/reference.txt \ |
| --style Fadel |
| ``` |
|
|
| The evaluation script will output DER, WER, and SER metrics with different configurations: |
| - With/without case ending |
| - Including/excluding no diacritic |
|
|