To What Extent Is LLM Performance on Multiple-Choice Questions Driven by Data Leakage? A Case Study with Contamination-Controlled Spanish Undergraduate Exams

Authors

  • Eva Sánchez Salido UNED, Spain
  • Adrian Ghajari UNED, Spain
  • Guillermo Marco UNED, Spain
  • Julio Gonzalo UNED, Spain
  • Jesús Abizanda UNED, Spain
  • Roser Morante UNED, Spain
  • Alejandro Benito-Santos UNED, Spain
  • Laura Plaza UNED, Spain
  • Jorge Carrillo-de-Albornoz UNED, Spain
  • Víctor Fresno UNED, Spain
  • Enrique Amigó UNED, Spain
  • Andrés Fernández García UNED, Spain

DOI:

https://doi.org/10.4114/intartif.vol29iss77pp131-151

Keywords:

Natural Language Processing, Large Language Models, Evaluation, Data Contamination, Multiple-Choice Questions, Exams, Spanish

Abstract

The performance of Large Language Models (LLMs) on multiple-choice university-level exam benchmarks such as MMLU is often reported as highly competitive; however, such results raise persistent concerns regarding contamination of public datasets, English-centric bias, and over-reliance on aggregate accuracy as the primary evaluation signal. In particular, the widespread public availability of evaluation data makes it difficult to disentangle genuine generalization from memorization of seen content, while offering limited insight into models’ abilities on culturally grounded assessments beyond English. To address these issues, we introduce lunes (Leakage-controlled Undergraduate National Exams of Spain), a new benchmark of 11,881 multiple-choice questions drawn from official final-year undergraduate exams in Spanish, covering 104 courses across 22 degree programs. The dataset has been rigorously verified to exhibit minimal public web exposure through a combination of automated web search and manual inspection, which enables evaluation under minimal contamination conditions in a non-English, country-specific academic setting. Our results show that (i) LLMs retain strong performance on general knowledge and factual questions, even in the absence of web-accessible training data, suggesting that contamination alone does not explain their success on public benchmarks; (ii) however, their performance degrades substantially on culturally grounded and country-specific content, particularly in domains such as Spanish law, economy, and social structure. Remarkably, models consistently perform better on Anglo-centric content than on Spain-specific material even when answering in Spanish, suggesting that the bottleneck lies in culturally grounded knowledge rather than in language skills per se. A question-level error analysis further reveals that these failures reflect systematic gaps in local institutional, legal, and geographical knowledge, even for high-resource languages such as Spanish, that aggregate metrics systematically obscure.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Downloads

Published

2026-04-29

How to Cite

Sánchez Salido, E., Ghajari, A., Marco, G., Gonzalo, J., Abizanda, J., Morante, R., Benito-Santos, A., Plaza, L., Carrillo-de-Albornoz, J., Fresno, V., Amigó, E., & Fernández García, A. (2026). To What Extent Is LLM Performance on Multiple-Choice Questions Driven by Data Leakage? A Case Study with Contamination-Controlled Spanish Undergraduate Exams. Inteligencia Artificial, 29(77), 131–151. https://doi.org/10.4114/intartif.vol29iss77pp131-151