Generating a Culturally and Linguistically Adapted Word Similarity Benchmark for Yucatec Maya

Alejandro Molina-Villegas; Joel Suro-Villalobos; Jorge Reyes-Magaña; Silvia Fernandez-Sabido

doi:10.4114/intartif.vol28iss76pp283-300

Authors

Alejandro Molina-Villegas SECIHTI - Centro de Investigación en Ciencias de Información Geoespacial, Yucatan, Mexico https://orcid.org/0000-0001-9398-8844
Joel Suro-Villalobos ShogunOS, Mexico
Jorge Reyes-Magaña Universidad Autónoma de Yucatán, Mexico
Silvia Fernandez-Sabido SECIHTI - Centro de Investigación en Ciencias de Información Geoespacial, Yucatan, Mexico

DOI:

https://doi.org/10.4114/intartif.vol28iss76pp283-300

Keywords:

Yucatec Maya, Low-resource NLP, Word embeddings for Indigenous languages, Swadesh list, Culturally grounded NLP

Abstract

In the field of AI, word embedding models have proven to be one of the most effective methods for capturing semantic and syntactic relationships between words, enabling significant advancements in natural language processing. However, producing word embeddings for low-resource indigenous languages—such as Yucatec Maya—often suffers from poor reliability due to limited data availability and unsuitable evaluation benchmarks.
In this work, we propose a novel methodology for constructing reliable word embeddings by adapting the Swadesh List for semantic similarity evaluation. Our approach involves translating the Swadesh List from a high-resource pivot language into the target language, applying linguistic and cultural filtering, and correlating similarity scores between pivot-language embeddings from large language models and target-language embeddings. Our results demonstrate that this method produces reliable and interpretable embeddings for Yucatec Maya. Furthermore, our analysis provides compelling evidence that the choice of evaluation benchmark has a far greater impact on reported performance than hyperparameter optimization.
This approach establishes a robust new framework with the potential to be adapted for improving word embedding generation in other low-resource languages.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Generating a Culturally and Linguistically Adapted Word Similarity Benchmark for Yucatec Maya

Authors

DOI:

Keywords:

Abstract

Downloads

Metrics

Downloads

Published

How to Cite

Issue

Section

License

open

Inteligencia Artificial
_{An international open access journal.
Edited by Iberamia. e-ISSN: 1988-3064}

Make a Donation

J. Impact Factor 2024: 3.7 (Q2)

ONGOING ISSUE

ALL ISSUES

Information

Current Issue

Generating a Culturally and Linguistically Adapted Word Similarity Benchmark for Yucatec Maya

Authors

DOI:

Keywords:

Abstract

Downloads

Metrics

Downloads

Published

How to Cite

Issue

Section

License

open

Inteligencia ArtificialAn international open access journal.Edited by Iberamia. e-ISSN: 1988-3064

Make a Donation

J. Impact Factor 2024: 3.7 (Q2)

ONGOING ISSUE

ALL ISSUES

Information

Current Issue

Inteligencia Artificial
_{An international open access journal.
Edited by Iberamia. e-ISSN: 1988-3064}