Generating a Culturally and Linguistically Adapted Word Similarity Benchmark for Yucatec Maya
DOI:
https://doi.org/10.4114/intartif.vol28iss76pp283-300Keywords:
Yucatec Maya, Low-resource NLP, Word embeddings for Indigenous languages, Swadesh list, Culturally grounded NLPAbstract
In the field of AI, word embedding models have proven to be one of the most effective methods for capturing semantic and syntactic relationships between words, enabling significant advancements in natural language processing. However, producing word embeddings for low-resource indigenous languages—such as Yucatec Maya—often suffers from poor reliability due to limited data availability and unsuitable evaluation benchmarks.
In this work, we propose a novel methodology for constructing reliable word embeddings by adapting the Swadesh List for semantic similarity evaluation. Our approach involves translating the Swadesh List from a high-resource pivot language into the target language, applying linguistic and cultural filtering, and correlating similarity scores between pivot-language embeddings from large language models and target-language embeddings. Our results demonstrate that this method produces reliable and interpretable embeddings for Yucatec Maya. Furthermore, our analysis provides compelling evidence that the choice of evaluation benchmark has a far greater impact on reported performance than hyperparameter optimization.
This approach establishes a robust new framework with the potential to be adapted for improving word embedding generation in other low-resource languages.
Downloads
Metrics
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Iberamia & The Authors

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Open Access publishing.
Lic. under Creative Commons CC-BY-NC
Inteligencia Artificial (Ed. IBERAMIA)
ISSN: 1988-3064 (on line).
(C) IBERAMIA & The Authors