\section{Metrics}

% Describes all the metrics involved in your task. If you develop new metrics, you must provide the code for the evaluation. Please, be specific on what you're measuring and whether there are any baselines to be considered. Also, in case this dataset, possibly in English or other languages, has been used in other research, please report what results were observed if considered interesting.

For both tasks we employ an accuracy metric which is computed differently depending on the task.

\paragraph{Completion Task.} To automatically calculate the accuracy on the \textit{completion} task we compute the edit distance\footnote{The implementation from \url{https://docs.python.org/3/library/difflib.html} was employed.} between the provided completion and the correct ending of the proverb. 

\paragraph{Multi-Choice Task.} The accuracy of the \textit{multi-choice} task was defined as the ratio of correctly chose options (which is always E) over the multiple choices. %Also in this task, each prompt was presented to each model three times; if no majority emerged across the three runs, the response was marked as incorrect.

%For all tasks, a zero-shot prompting strategy was employed and all requests have been sent separately via API, specifically using the OpenRouter unified interface~\cite{openrouter2024}. For all models the temperature was left at the default OpenRouter value of 1.0 since we countered their nondeterministic nature by employing a majority vote.