Keywords: benchmark, nlp, LLMs, evaluation, entity linking, knowledge graph, cultural entities
TL;DR: This work introduces a benchmark and evaluation framework to measure how well LLMs understand Latin American entities using knowledge graphs and probing methods, revealing consistent performance gaps compared to other regions.
Abstract: Large Language Models (LLMs) achieve strong results on general knowledge benchmarks, yet their coverage of region-specific entities—particularly from Latin America—remains limited. To address this gap, we propose CHOCLO, an entity-centric methodology for evaluating LLM knowledge of culturally relevant entities in Latin America. The methodology extracts structured facts from domain-specific resources and organizes them into knowledge graphs spanning nine categories, resulting in more than 44,000 entities and 130,000 questions. Evaluation is carried out through two complementary strategies. The first computes factual scores using LLM-as-a-judge as a measure of accuracy. The second trains probing models that predict these scores directly from LLM embeddings, enabling generation-free evaluation. Results consistently show a regional disparity: GPT-5 and GPT-3.5 score markedly lower on Latin American entities compared to the U.S. and Europe, while models such as Mistral, DeepSeek, and QWEN underperform across all regions. Category-level analysis further reveals that fauna, flora, and traditions are comparatively better represented, whereas public figures and objects show the largest deficits. CHOCLO thus exposes systematic disparities in how LLMs encode Latin American knowledge and provides a step toward culturally inclusive benchmarks that support fairer global evaluation.
Primary Area: datasets and benchmarks
Submission Number: 2426
Loading