Keywords: Large Language Models, Reasoning, Multi-choice questions, Italian Proverbs
TL;DR: ProverbIT is a benchmark designed to evaluate the reasoning capabilities of Large Language Models (LLMs) beyond simple pattern matching.
Abstract: We present ProverbIT a benchmark designed to evaluate the reasoning capabilities of Large Language Models (LLMs) beyond simple pattern matching. While current models demonstrate high proficiency in text generation, their ability to discriminate between plausible but incorrect options remains understudied. ProverbIT addresses this gap through a challenging multiple-choice task focused on Italian proverbs. In this setting, models are provided with the beginning of a proverb and must select the correct completion from five options. Crucially, four options are always incorrect distractors, making the fifth option, `None of the others', the only valid answer. This adversarial design forces models to abandon surface-level heuristics and engage in deeper semantic reasoning to actively discard misleading alternatives. To distinguish between a lack of knowledge and a failure in discriminative reasoning, we also introduce a generative completion baseline, where models simply complete the proverb from its initial fragment. The dataset comprises 100 common Italian proverbs, curated and validated by native speakers.
Source: zip
Ceur: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 12
Loading