Generating a Culturally and Linguistically Adapted Word Similarity Benchmark for Yucatec Maya

Authors

  • Alejandro Molina-Villegas SECIHTI - Centro de Investigación en Ciencias de Información Geoespacial, Yucatan, Mexico https://orcid.org/0000-0001-9398-8844
  • Joel Suro-Villalobos ShogunOS, Mexico
  • Jorge Reyes-Magaña Universidad Autónoma de Yucatán, Mexico
  • Silvia Fernandez-Sabido SECIHTI - Centro de Investigación en Ciencias de Información Geoespacial, Yucatan, Mexico

DOI:

https://doi.org/10.4114/intartif.vol28iss76pp283-300

Keywords:

Yucatec Maya, Low-resource NLP, Word embeddings for Indigenous languages, Swadesh list, Culturally grounded NLP

Abstract

In the field of AI, word embedding models have proven to be one of the most effective methods for capturing semantic and syntactic relationships between words, enabling significant advancements in natural language processing. However, producing word embeddings for low-resource indigenous languages—such as Yucatec Maya—often suffers from poor reliability due to limited data availability and unsuitable evaluation benchmarks.
In this work, we propose a novel methodology for constructing reliable word embeddings by adapting the Swadesh List for semantic similarity evaluation. Our approach involves translating the Swadesh List from a high-resource pivot language into the target language, applying linguistic and cultural filtering, and correlating similarity scores between pivot-language embeddings from large language models and target-language embeddings. Our results demonstrate that this method produces reliable and interpretable embeddings for Yucatec Maya. Furthermore, our analysis provides compelling evidence that the choice of evaluation benchmark has a far greater impact on reported performance than hyperparameter optimization.
This approach establishes a robust new framework with the potential to be adapted for improving word embedding generation in other low-resource languages.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Downloads

Published

2025-09-25

How to Cite

Molina-Villegas, A., Suro-Villalobos, J., Reyes-Magaña, J., & Fernandez-Sabido, S. (2025). Generating a Culturally and Linguistically Adapted Word Similarity Benchmark for Yucatec Maya. Inteligencia Artificial, 28(76), 283–300. https://doi.org/10.4114/intartif.vol28iss76pp283-300