\section{Challenge Description}

% This section provides a detailed overview of the specific challenge. You will describe all the tasks involved in the challenge. 

% Expectations from the challenge: why was it interesting to investigate this.


%In this Section, we describe the proposed challenge  for evaluating $4$ different models on the \pro benchmark, followed by an error analysis. 

The challenge constitutes mainly a \textit{multiple-choice} task, designed to assess the models’ ability to select the correct proverb completion rather than relying on surface-level pattern matching. Specifically, given the beginning of a proverb, models are prompted to choose among several alternative endings that are all syntactically well-formed and, to varying degrees, semantically or stylistically plausible, but never correct. The task thus requires the models to actively evaluate and discard misleading options, identifying `\sans{None of the others}’ as the only correct choice. This setting shifts the focus from automatic generation to discriminative reasoning, emphasizing fine-grained semantic control and proverb-level understanding.

To contextualize the results of the main task, we also introduce a generative \textit{completion} task as a baseline. In this evaluation, models are asked to directly complete each proverb given its initial fragment. 
This baseline serves to estimate the models’ prior familiarity with the proverbs and to verify whether potential errors in the multiple-choice setting stem from a lack of knowledge rather than from the nature of the selection task itself.


% Per introdurre il completion task che in clic-it era definito "ancillare", inizierei formulando:
% To introduce a baseline in our analysis...

