Abstract: Allusion recognition—a task demanding contextual activation of cultural knowledge—serves as a critical test of large language models’ (LLMs) ability to deploy stored information in open-ended, figurative settings. We introduce a framework for evaluating Persian literary allusions through (1) classical poetry annotations and (2) LLM-generated texts embedding allusions in novel contexts. By combining knowledge assessments, multiple-choice tasks, and open-ended recognition, we isolate whether failures stem from knowledge gaps or activation challenges. Evaluations across 11 LLMs reveal a critical disconnect: while models exhibit strong foundational knowledge and high multiple-choice accuracy, their performance drops significantly in open-ended settings, particularly for indirect references. Reasoning-optimized models generalize better to novel contexts, whereas distilled models show marked degradation in cultural reasoning. The gap underscores that LLMs’ limitations arise not from missing knowledge but contextual recall failure—an inability to spontaneously activate cultural references without explicit cues. Our work positions allusion recognition as a benchmark for evaluating contextual knowledge deployment, urging training paradigms that bridge factual recall and culturally grounded reasoning.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking; NLP datasets; evaluation; datasets for low resource languages
Contribution Types: Data resources
Languages Studied: Persian
Submission Number: 7231
Loading