rufaelfekadu's picture
Upload model checkpoint and constants (http)
af1bff0 verified
|
Raw
History Blame Contribute Delete
2.22 kB
---
language: ['ar']
tags:
- diacritization
- nlp
- arabic
- transformer
metrics:
- DER
- WER
- SER
---
# Automatic Restoration of Diacritics for Speech Data Sets
This is a transformer-baed model for Arabic text diacritization as described [here](https://github.com/rufaelfekadu/Diac.git).
## Evaluation Results
### Evaluation on clartts
DER (Diacritic Error Rate)
| Configuration | With case ending | Without case ending |
|---|---|---|
| **Including no diacritic** | 10.33% | 8.45% |
| **Excluding no diacritic** | 12.72% | 10.33% |
WER (Word Error Rate)
| Configuration | With case ending | Without case ending |
|---|---|---|
| **Including no diacritic** | 30.16% | 19.71% |
| **Excluding no diacritic** | 29.91% | 19.60% |
## How to Use
### Installation
```bash
git clone https://github.com/rufaelfekadu/diac.git
cd diac
pip install -e .
```
### Loading the Model
```python
from diac.models import DiacritizationModule
model = DiacritizationModule.from_pretrained(
"rufaelfekadu/diac-transformer-text-only-tashkeela",
tokenizer_constants_path="constants/" # Path to constants directory
)
```
### Running Inference
```python
# Predict diacritization for a text file
model.predict_file(
input_file="path/to/input.txt",
output_file="path/to/output.txt"
)
# Or predict for a single text string
diacritized_text = model.predict_text("مرحبا بك")
```
### Running Evaluation
To evaluate the model on your own test set:
1. **Run inference** to generate predictions:
```bash
python inference.py \
--config configs/<model>.yml \
--opts \
DATA.TEST_PATH path/to/test.txt \
INFERENCE.MODEL_PATH <path_to_checkpoint> \
INFERENCE.OUTPUT_PATH path/to/predictions.txt
```
2. **Prepare reference file** (if needed):
```bash
python src/diac/utils/prep_ref.py \
--input_file path/to/test.txt \
-o path/to/output_dir
```
3. **Calculate metrics** (DER, WER, SER):
```bash
python src/diac/utils/eval.py \
-ofp path/to/predictions.txt \
-tfp path/to/reference.txt \
--style Fadel
```
The evaluation script will output DER, WER, and SER metrics with different configurations:
- With/without case ending
- Including/excluding no diacritic