--- license: apache-2.0 library_name: transformers tags: - translation - multilingual - ancient-languages - akkadian language: - akk - en model-index: - name: AKK-300m results: - task: type: translation name: "Akkadian (Cuneiform) → English (Latin)" metrics: - name: bleu type: bleu value: 70.35 - task: type: translation name: "Akkadian (Transliteration) → English (Latin)" metrics: - name: bleu type: bleu value: 73.18 - task: type: transliteration name: "Akkadian (Cuneiform) → Akkadian (Transliteration)" metrics: - name: bleu type: bleu value: 85.43 - task: type: translation name: "English (Latin) → Akkadian (Transliteration)" metrics: - name: bleu type: bleu value: 41.80 - task: type: translation name: "English (Latin) → Akkadian (Cuneiform)" metrics: - name: bleu type: bleu value: 45.23 --- # AKK-300m Introducing AKK-300m, a model capable of handling a diverse number of cuneiform translation, transliteration, and correction tasks. ## 1. Model description This is an instruct model, meaning it is capable of multiple tasks. It is intended primarily for translation + transliteration, but it can also be used for reverse translation as well. ### Translation Instructions: * "Translate Akkadian cuneiform to English" + cuneiform signs → English * "Translate complex Akkadian transliteration to English" + complex transliteration → English * "Translate Akkadian simple transliteration to English" + simple transliteration → English * "Translate Akkadian grouped transliteration to English" + transliteration with special symbols → English * "Translate English to Akkadian cuneiform" + English → Akkadian cuneiform signs * "Translate English to simple Akkadian transliteration" + English → Akkadian simple transliteration with no special symbols * "Translate English to grouped Akkadian transliteration" + English → Akkadian transliteration grouped into words with special symbols ### Transliteration Instructions: * "Transliterate Akkadian cuneiform to simple Latin Characters" + cuneiform signs → transliteration with no special symbols * "Transliterate Akkadian cuneiform to grouped Latin characters" + cuneiform signs → transliteration with special symbols/subscripts * "Group Akkadian transliteration into likely words" + simple transliteration → transliteration with special symbols/subscripts ### Mising Sign Insructions: * 'Identify the missing signs: ' + string of Akkadian cuneiform, transliterations ### Base model This is a finetuned version of [google's umt5-small](https://huggingface.co/google/umt5-small). ## 2. Usage (code snippet) ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_path = "Thalesian/AKK-300m" tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, local_files_only=True) model = AutoModelForSeq2SeqLM.from_pretrained(model_path) # 1) Prepare your cuneiform input prompt = "Translate Akkadian cuneiform to English: " input_text = "𒄨 𒃼 𒁺 𒊭 𒀸 𒌅 𒆰 𒋾 𒀸 𒋩 𒂗 𒋙 𒆰 𒆳 𒆷 𒈠 𒄀 𒊑 𒋗 𒁶 𒋻 𒁁 𒋾 𒌑 𒁖 𒆥 𒄣 𒀀 𒁍 𒄫 𒄑 𒁍 𒉡 𒈠 𒍣 𒆥 𒆧 𒅎 𒉡 𒌋 " # 2) Tokenize & get model outputs inputs = tokenizer(prompt + input_text, return_tensors="pt") outputs = model.generate(**inputs, max_length=64) # 3) Decode prediction prediction = tokenizer.decode(outputs[0], skip_special_tokens=True) print("Reference:", "young man valiant who through help assur lord all of not submissive one like a pottery bowl crush minutely like a flood flatten as nothing count") print("Prediction:", prediction) ``` ## 3. Training and evaluation data Data was used from the [Akkademia project](https://github.com/gaigutherz/Akkademia), previously published in [PNAS Nexus](https://academic.oup.com/pnasnexus/article/2/5/pgad096/7147349). Additional data for pre-training and training came from [CDLI Akkadian](https://www.cdli.earth) data. More information on the training data, as well as the test and validation splits, can be found on both the GitHub and published methodology. ### Training procedure It was trained in 5 tranches with different datasets and collators: * a pretraining dataset (transliterations only) of CDLI transliterated data (389,834 lines) and Akkademia + CDLI translated data (126,649 lines) * a training dataset which included Akkademia and CDLI (126,649 lines) And 3 different collation methods: * pretraining collation which introduces an asterisk to represent missing signs * missing sign translations, which randomly introduces an asterisk to represent missing signs * translation error, which randomly introduces the wrong sign into input data to simulate transliteration or glyph error ### Final stage training hyperparameters The following hyperparameters were used during training: * learning\_rate: 5e-05 * train\_batch\_size: 128 * eval\_batch\_size: 128 * seed: 42 * distributed\_type: multi-GPU * num\_devices: 1 * total\_train\_batch\_size: 128 * total\_eval\_batch\_size: 128 * optimizer: Use OptimizerNames.ADAMW\_TORCH\_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer\_args=No additional optimizer arguments * lr\_scheduler\_type: linear * lr\_scheduler\_warmup\_steps: 5000 * num\_epochs: 200 ### Framework versions * Transformers 4.50.3 * PyTorch 2.6.0+cu126 * Datasets 3.3.0 * Tokenizers 0.21.1 ## 4.1 Metrics by Line | From Language | From Script | To Language | To Script | BLEU | CHRF | METEOR | | --- | --- | --- | --- | --- | --- | --- | | Akkadian | Transliteration | Akkadian | Cuneiform | 95.63 | 95.22 | - | | Akkadian | Cuneiform | English | Latin | 70.35 | 79.37 | 0.74 | | Akkadian | Transliteration | English | Latin | 73.18 | 81.79 | 0.76 | | English | Latin | Akkadian | Cuneiform | 45.23 | 45.24 | - | | English | Latin | Akkadian | Transliteration | 41.80 | 63.69 | - | | Akkadian | Cuneiform | Akkadian | Transliteration | 85.42 | 93.23 | - | ## 4.2 Metrics by Document | From Language | From Script | To Language | To Script | BLEU | CHRF | METEOR | | --- | --- | --- | --- | --- | --- | --- | | Akkadian | Transliteration | Akkadian | Cuneiform | 26.41 | 38.55 | - | | Akkadian | Cuneiform | English | Latin | 27.42 | 47.01 | 0.43 | | Akkadian | Transliteration | English | Latin | 29.19 | 48.68 | 0.44 | | English | Latin | Akkadian | Cuneiform | 14.02 | 20.74 | - | | English | Latin | Akkadian | Transliteration | 14.84 | 31.46 | - | | Akkadian | Cuneiform | Akkadian | Transliteration | 25.36 | 40.35 | - | ## 5. Intended uses – Short Akkadian lines, transliteration pipelines, reverse lookup experiments. ## 6. Limitations – Context window is only 64 tokens. ## 7. How to Cite ```bibtex @misc{drake2025akk300m, title = {{AKK-300m}: A UMT5-Small for Akkadian⇄English}, author = {Drake, B. Lee}, year = {2025}, howpublished = {\url{https://huggingface.co/Thalesian/AKK-300m}} } ```