
\section{Results, Challenge-Specific Insights and Lessons Learned}
The evaluation results are reported in Table~\ref{tab:results}. Each model was prompted once for each of the 100-samples in \pro and with a zero shot setting. We adopted a chat template, no system prompt and a temperature of 0. 

%\url{https://github.com/CALAMITA-AILC/calamita-eval/blob/main/results/README.md}



\begin{table}[ht]
\centering
\begin{tabular}{lcc|cc}
\multicolumn{1}{c}{\multirow{2}{*}{\textbf{Model}}} & \multicolumn{2}{c}{\textbf{Completion Results}} & \multicolumn{2}{c}{\textbf{Multi-Choice Results}} \\ \cline{2-5} 
\multicolumn{1}{c}{} & \textbf{Accuracy} & \textbf{Correct/Total} & \textbf{Accuracy} & \textbf{Correct/Total} \\ \hline
Llama-3.1-70B-Instruct & 0.67 & 67/100 & 0.14 & 14/100 \\ \hline
Llama-3.1-8B-Instruct & 0.20 & 20/100 & 0.03 & 3/100 \\ \hline
LLaMAntino-3-ANITA-8B-Inst-DPO-ITA & 0.13 & 13/100 & 0.05 & 5/100 \\ \hline
Minerva-7B-instruct-v1.0 & 0.13 & 13/100 & 0.00 & 0/100 \\ \hline
\end{tabular}
\caption{Results on both the completion and multi-choice tasks.}
\label{tab:results}
\end{table}

% Following the model adopted in \url{https://arxiv.org/pdf/2512.04759}, describe and critically discuss here the results for your Challenge shared by the evaluation team, according to the following structure.

% Refer to the CALAMITA leaderboard for the official results (for all the accepted challenges, results will be published here): 

\paragraph{Performance across models.}
The performances are overall very low, with Llama 70B being the best performing model ($0.14$ accuracy). For this bigger model, by comparing the results on the completion tasks vs the multi-choice task, we can observe a sharp drop in performances (from $0.67$ to $0.14$), indicating that even when the model recognizes and is able to complete a proverb, it is not competent enough to discard the four wrong completions in the multi-choice setting.

While the performance gap between tasks persists for smaller models, their competence is already limited in the completion baseline. 
A qualitative analysis of the outputs reveals a significant deficiency in instruction following: these models are often unable to adhere to simple constraint of `only answer with a letter'. This verbosity is strictly penalized by the edit distance metric, leading to near-zero accuracy scores.
% non native, fine-tuned native (ANITA), natively native (Minerva)

\paragraph{Error Analysis}
Table~\ref{tab:errors} gives a detailed report of the answers chosen by each model. The error analysis reveal distinct error patterns and biases among the models. Both Llama 70B and Minerva 7B display a strong tendency to select the `Synonym' distractor (Option B), choosing it 56 and 62 times, respectively. Conversely, Llama 8B exhibits a bias toward the `Assonant' distractor (Option A), selecting it in 43 instances. Additionally, the table highlights a significant instruction-following issue for Minerva 7B, which produced 33 `Invalid' responses, unlike the other models which strictly adhered to the valid letter format.

\begin{table}[ht]
  \centering
  \label{tab:errors_absolute}
  \begin{tabular}{l c c c c c c}
    \textbf{Model} & \textbf{A} & \textbf{B} & \textbf{C} & \textbf{D} & \textbf{E} & \textbf{Invalid} \\ \hline
    Llama-3.1-70B-Instruct & $10$ & $56$ & $13$ & $7$ & $14$ & $0$ \\ \hline
    Llama-3.1-8B-Instruct & $43$ & $20$ & $20$ & $14$ & $3$  & $0$ \\ \hline
    Minerva-7B-instruct-v1.0 & $5$ & $62$ & $0$ & $0$ & $0$ & $33$ \\ \hline
    LLaMAntino-3-ANITA-8B-Inst-DPO-ITA & $31$ & $21$ & $13$ & $30$ & $5$ & $0$ \\ \hline
\end{tabular}
    \caption{Absolute number of chosen answers in the \pro task. A = Assonant,  B = Synonym,  C = Inverse D = Trivial E = None of the others. E is always the correct answer.}
    \label{tab:errors}    
\end{table}





% Some qualitative discussion of results; showing a couple of examples if possible

\paragraph{Discussion.}
The drop in performances between the completion and the multi-choice setting indicates that the higher completion accuracy likely stems from surface-level pattern matching rather than deep semantic comprehension. Since the models fail to filter out wrong completions when presented with distractors, it appears that simple language modeling is insufficient and that more powerful reasoning models are required to solve this task.

% expected vs unexpected outcomes





% RESULTS COMPLETION
% | Model                                         | Accuracy   | Correct/Total   |
% |-----------------------------------------------|------------|-----------------|
% | meta-llama/Llama-3.1-70B-Instruct             | 0.6700     | 67/100          |
% | meta-llama/Llama-3.1-8B-Instruct              | 0.2000     | 20/100          |
% | sapienzanlp/Minerva-7B-instruct-v1.0          | 0.1300     | 13/100          |
% | swap-uniba/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA | 0.1300     | 13/100          |



% RESULTS MULTI-CHOICE
% | Model                                         | Accuracy | Correct/Total   |
% |-----------------------------------------------|----------|-----------------|
% | meta-llama/Llama-3.1-70B-Instruct             |     0.14 | 14/100          |
% | swap-uniba/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA |     0.05 | 5/100           |
% | meta-llama/Llama-3.1-8B-Instruct              |     0.03 | 3/100           |
% | sapienzanlp/Minerva-7B-instruct-v1.0          |     0    | 0/100           |


  




% DETAILED RESPONSES ON MULTI-CHOICE
% model,A,B,C,D,E,Nessuna,accuracy,total_questions
% meta-llama/Llama-3.1-70B-Instruct,10,56,13,7,14,0,14.0,100
% meta-llama/Llama-3.1-8B-Instruct,43,20,20,14,3,0,3.0,100
% sapienzanlp/Minerva-7B-instruct-v1.0,5,62,0,0,0,33,0.0,100
% swap-uniba/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA,31,21,13,30,5,0,5.0,100


% | Model                                         | A  | B  | C  | D  | E  | Nessuna | Accur | Total Q |
% | meta-llama/Llama-3.1-70B-Instruct             | 10 | 56 | 13 | 7  | 14 | 0       | 14.0  | 100     |
% | meta-llama/Llama-3.1-8B-Instruct              | 43 | 20 | 20 | 14 | 3  | 0       | 3.0   | 100     |
% | sapienzanlp/Minerva-7B-instruct-v1.0          | 5  | 62 | 0  | 0  | 0  | 33      | 0.0   | 100     |
% | swap-uniba/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA | 31 | 21 | 13 | 30 | 5  | 0       | 5.0   | 100     |
