Keywords: LLM benchmarking, LLM reasoning, Indic languages, Crossword puzzles, Multilingual evaluation
TL;DR: A multilingual reasoning benchmark using crossword puzzles to evaluate LLMs' linguistic and cultural generalisation capabilities across 19 Indian languages
Abstract: Current benchmarks for evaluating Large Language Model (LLM) reasoning capabilities in Indic languages predominantly assess knowledge retrieval and linguistic competence tasks (question answering, summarization, text translation, etc.). They also rely on culturally irrelevant or neutral content. This means that the reasoning abilities of LLMs in such low resource language remain largely unexplored. This work introduces Buddhi-Pragati, a novel benchmark framework that employs crossword puzzles to evaluate rule-based reasoning in 19 Indic languages. We present three key contributions: (1) the identification of crossword puzzles as memorization-resistant evaluation instruments for low-resource language reasoning, (2) a corpus of 120,000 culturally-authentic clue-answer pairs scored for Indian contextual relevance using a novel embedding-based approach, and (3) a dataset of 4,750 crossword puzzles generated through memetic algorithms optimized for constraint density and cultural authenticity. Our framework also implements seven experimental configurations across a tunable number of models, evaluating performance through word accuracy, letter accuracy, and intersection consistency metrics. By grounding our clues-answers corpus in culturally-relevant content while maintaining resistance to memorization, Buddhi-Pragati addresses critical gaps in current Indic language benchmarking paradigms. The complete datasets and evaluation harness are publicly available to facilitate reproducible research in multilingual reasoning assessment.
Primary Area: datasets and benchmarks
Submission Number: 19901
Loading