\section{Challenge: Introduction and Motivation}

% In this section, you will outline the motivation behind the proposed challenge of evaluating large language models (LLMs) in Italian. You will discuss the importance of your challenge and the linguistic and cultural aspects that it considers. This section sets the stage by explaining why your challenge is relevant and necessary for evaluating LLMs in Italian and addressing potential gaps and opportunities in current research.
% Mention any "sister challenge" if applicable, maybe for another language, and also what your expectations are regarding model performance, and why.



The emergence of Large Language Models (LLMs) and Large Reasoning Models (LRMs) has revolutionized the natural language processing landscape across diverse domains~\cite{lewkowycz2022solving}. Yet, while these models exhibit remarkable proficiency handling sophisticated linguistic phenomena~\cite{chang2024survey}, substantial uncertainty remains regarding their reliability in processing and interpreting culturally embedded linguistic expressions~\cite{fornaciari2024hard}, such as proverbs.
% [ORIGINAL]
% The emergence of Large Language Models (LLMs) has revolutionized the natural language processing landscape across diverse domains, from machine translation and text summarization to code generation and complex reasoning tasks~\cite{lewkowycz2022solving}. While these models demonstrate remarkable capabilities in handling sophisticated linguistic phenomena~\cite{chang2024survey}, significant gaps persist in our comprehension of how these systems process culturally embedded linguistic expressions~\cite{fornaciari2024hard}.
% Proverbs present an interesting testbed for language model evaluation. 
% 
A proverb is a short, commonly known saying: it expresses a general truth, piece of wisdom, or practical advice, often based on common sense or cultural experience. The understanding of proverbs thus represents a key milestone in language proficiency, and access to the individual components of a proverb allows for the investigation of both lexical access issues and deeper semantic mechanisms.

Since proverbs are high-frequency patterns, standard completion tasks often yield high performance. While we assess this generative baseline, we moved beyond introducing a more complex challenge, evaluating discriminative selection. In this multiple-choice setting, the model must not only recognize the pattern but also evaluate and dismiss plausible alternatives.
% This transformation from pattern completion to discriminative reasoning may be insightful to investigate whether models are capable of grasping the underlying meaning of these cultural expressions, or solely rely on statistical co-occurrence patterns.

In this work we introduce \pro to the "Challenge the Abilities of LAnguage Models in ITAlian” (CALAMITA) initiative~\cite{attanasio2024calamita, nissim2025challengingabilitieslargelanguage}. \pro is a dataset comprising multiple-choice questions centered on Italian proverbs presented at \textit{CLiC-it 2025}~\cite{mensa2025proverbit, clicit-2025}. By manually designing alternative endings for the proverbs, we can systematically categorize error types and patterns. Our findings reveal a significant performance dichotomy: despite demonstrating some familiarity of these proverbs in generative settings, all models exhibit a sharp decline in accuracy when forced to operate within a multiple-choice framework.
% [ORIGINAL - pretty evident is just trimming down and using synonyms to rephrase]
% By manually designing alternative endings for the proverbs, we can systematically examine the types of errors LLMs make and identify common failure patterns. Our investigation shows a striking paradox: while nearly all models possess knowledge of the proverbs in our dataset, performance deteriorates dramatically when moving from auto-completion to multiple-choice selection.



% MAYBE ADD THIS PARAGRAPH
% The contribution of this paper to CALAMITA is x_fold: \textit{i}) we contribute to Italian NLP benchmarks by introducing a dataset that addresses the under-representation of Italian in comprehensive language model evaluation resources~\cite{wu2025bitter}; \textit{ii}) we conduct a thorough evaluation across $x$ models, [describe kind of models], providing comprehensive performance analysis on proverb completion tasks; [conclusion]