\section{Limitations}

While \pro serves as a robust benchmark, we also report a few limitations.
%
The dataset is relatively small, consisting of only $100$ proverbs. While this size is sufficient for a challenge-oriented benchmark, it could limits the statistical power of the evaluation and may reduce the robustness of the conclusions drawn from the results.
%
The construction of the alternatives in the multi-choice setting was performed manually. While this design choice allows for fine-grained control over the types of alternative endings, it also makes the dataset harder to scale and extend to a larger number of proverbs.
%
Finally, we observe that especially when testing with smaller models, our metrics require precise instruction following, and so string-matching metrics like edit distance inflict severe penalties on models that are unable to suppress conversational filler in favor of the requested output.
%Moreover, the proverbs included in the dataset are commonly used expressions in everyday Italian. As a consequence, it is plausible that at least some of them were already present in the training data of the evaluated language models. Although the multi-choice setting mitigates this issue by requiring models to reject plausible but incorrect alternatives, the possibility of training data overlap cannot be entirely ruled out.

% Dimentions of the models? Results obtained through larger models present way higher accuracy.