🌽 MayaVoice — Machine Translation for Mayan Languages of Guatemala

MayaVoice is a machine translation system between Spanish and 14 Mayan languages of Guatemala. It is the first open-source large language model fine-tuned specifically for this linguistic family, designed to bridge the communication gap between Spanish-speaking services and the 6+ million Maya speakers across Guatemala.

Motivation

Guatemala is a multilingual country where 14 Mayan languages hold official recognition. Yet the linguistic barrier between Spanish and these languages severely limits access to healthcare, education, justice, and government services for millions of Maya speakers. MayaVoice aims to reduce this gap through AI-powered translation.

Supported Languages and Evaluation

The model was evaluated on a held-out test set of 300 sentences unseen during training:

Language Family Speakers (est.) N BLEU chrF
Mam Mamean 530,000 70 47.07 61.30
Kaqchikel K'ichean 450,000 57 41.34 56.29
Chuj Q'anjob'alan 65,000 11 31.32 54.78
Q'eqchi' K'ichean 800,000 22 27.25 52.03
Q'anjob'al Q'anjob'alan 170,000 23 23.38 43.92
K'iche' K'ichean 1,100,000 18 21.32 38.57
Tektiteko Mamean 5,000 20 15.76 43.78
Sipakapense K'ichean 8,000 7 15.75 48.87
Awakateko Mamean 20,000 12 15.46 34.46
Poqomchi' K'ichean 115,000 7 15.15 31.88
Poqomam K'ichean 50,000 24 11.21 48.97
Achi K'ichean 150,000 4 8.30 44.68
Itza' Yucatecan 1,000 7 4.02 17.21
Tz'utujil K'ichean 90,000 18 2.86 28.81
Weighted avg ~3.5M 300 40.28 55.07

Note on metrics: BLEU measures exact n-gram overlap; chrF measures character-level similarity, which is more appropriate for agglutinative languages like those of the Mayan family. Languages with lower N have wider confidence intervals.

Linguistic Observations

  • Mamean languages (Mam, Tektiteko, Awakateko) and K'ichean languages (Kaqchikel, K'iche', Q'eqchi') show the strongest results, correlating with greater availability of parallel training data.
  • Itza' (Yucatecan family) has the lowest performance, consistent with its status as a critically endangered language with very few digitized texts.
  • chrF scores are substantially higher than BLEU across all languages, which is expected for morphologically complex languages where partial word matches capture translation quality more faithfully.
  • Performance variation across the K'ichean branch (Kaqchikel 41.34 BLEU vs. Tz'utujil 2.86) highlights that genetic relatedness alone does not predict translation quality — data availability is the dominant factor.

Demo

Try MayaVoice live: 🌽 MayaVoice Space

Limitations

  • Domain bias: The training corpus has a non-uniform domain distribution, which may affect quality on everyday conversational language.
  • Low-resource languages: Itza', Tz'utujil, and Achi have significantly fewer training resources, reflected in their metrics.
  • Hallucination risk: Like all generative models, MayaVoice may produce plausible-looking but incorrect translations. Human verification is recommended for critical use cases.
  • Limited test set: 300 test sentences is a small evaluation set; per-language metrics with N < 10 are indicative, not conclusive.

Ethical Use

MayaVoice was developed to preserve and facilitate access to the Mayan languages of Guatemala. It should not be used to replace human translators in contexts where accuracy is critical (legal, medical), but rather as a support and accessibility tool.

Citation

@misc{regalado2025mayavoice,
  author = {Regalado Cardoso, Daniel},
  title = {MayaVoice: Machine Translation for 14 Mayan Languages of Guatemala},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/DanielRegaladoCardoso/mayavoice-llama3.1-8b-lora-v2}
}

Contact

  • Developer: Daniel Regalado Cardoso
  • Institution: University of Miami
  • Demo: MayaVoice Space
Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DanielRegaladoCardoso/mayavoice-llama3.1-8b-lora-v2

Space using DanielRegaladoCardoso/mayavoice-llama3.1-8b-lora-v2 1

Evaluation results