%% The first command in your LaTeX source must be the \documentclass command.
%%
%% Options:
%% twocolumn : Two column layout. Do not use twocolumn for papers submitted to CEUR-WS!
%% hf: enable header and footer.
\documentclass[
% twocolumn,
% hf,
]{ceurart}

%%
%% One can fix some overfulls
\sloppy

%%
%% Minted listings support 
%% Need pygment <http://pygments.org/> <http://pypi.python.org/pypi/Pygments>
\usepackage{listings}
\usepackage{url}
\usepackage{array}
\usepackage{booktabs}
\usepackage{subcaption}
\usepackage{todonotes}
\newcommand{\rowstrut}{\rule{0pt}{2.6ex}}
%% auto break lines
\lstset{breaklines=true}

\usepackage{tcolorbox}
\usepackage{xcolor}

\newcommand{\quotebox}[1]
{
  \begin{center}
    \fcolorbox{white}{blue!15!gray!15}{
      \begin{minipage}{0.95\linewidth}\vspace{10pt}
        \center
        \begin{minipage}{0.8\linewidth}{\space\Huge``}{#1}{\hspace{1.5em}\break\null\Huge\hfill''}
        \end{minipage}
        \smallbreak
      \end{minipage}
    }
\end{center}
}




%%
%% end of the preamble, start of the body of the document source.
\begin{document}

%%
%% Rights management information.
%% CC-BY is default license.
\copyrightyear{2026}
\copyrightclause{Copyright for this paper by its authors.
  Use permitted under Creative Commons License Attribution 4.0
  International (CC BY 4.0).}

%%
%% This command is for the conference information
\conference{EVALITA 2026: 9th Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian, Feb 26 – 27, Bari, IT}

%%
%% The "title" command
\title{Cruciverb-IT at EVALITA 2026: Overview of the Crossword Solving in Italian Task}


%%
%% The "author" command and its associated commands are used to define
%% the authors and their affiliations.
\author[1,2]{Cristiano Ciaccio}[%
orcid=0009-0001-6113-4761,
email=cristiano.ciaccio@ilc.cnr.it,
]
\address[1]{Department of Computer Science, University of Pisa, Italy}
\address[2]{Institute for Computational Linguistics "A. Zampolli" (CNR-ILC) - ItaliaNLP Lab, Pisa, Italy}

\author[3]{Gabriele Sarti}[%
orcid=0000-0001-8715-2987,
email=g.sarti@northeastern.edu,
]\fnmark[1]
\address[3]{Khoury College of Computer Sciences, Northeastern University, USA}


\author[2]{Alessio Miaschi}[%
orcid=0000-0002-0736-5411,
email=alessio.miaschi@ilc.cnr.it,
]

\author[2]{Felice Dell'Orletta}[%
orcid=0000-0003-3454-9387,
email=felice.dellorletta@ilc.cnr.it,
]

\author[4]{Malvina Nissim}[%
orcid=0000-0001-5289-0971,
email=m.nissim@rug.nl,
]

\address[4]{Center for Language and Cognition (CLCG), University of Groningen, The Netherlands}


\fntext[1]{Work done at the University of Groningen, The Netherlands.}

%%
%% The abstract is a short summary of the work to be presented in the
%% article.
\begin{abstract}
  Cruciverb-IT is the first shared task on Italian crossword solving, held at EVALITA 2026. The task comprises two subtasks: (1) answering individual crossword clues given the expected answer length, and (2) autonomously solving complete crossword grids of varying sizes. We release a dataset of approximately 410,000 Italian clue-answer pairs along with automatically generated crossword grids ranging from size 5×5 to 13×13. Five teams participated in the evaluation, submitting a total of 17 system runs. The best-performing system on Subtask 1 achieved 69\% accuracy at rank 1 and 0.72 MRR using a retrieval-augmented LLM approach, while the top system on Subtask 2 reached an average character accuracy of 92\%, fully solving 34\% of grids by means of a fine-tuned encoder-decoder model paired with a constraint-driven depth first search and ranking heuristics. Results show that while modern approaches achieve strong performance on individual clues and smaller grids, solving larger crosswords remains an open problem, with full match performance decreasing rapidly for grids larger than 5x5.
\end{abstract}

%%
%% Keywords. The author(s) should pick words that accurately describe
%% the work being presented. Separate the keywords with commas.
\begin{keywords}
  NLP \sep
  Crossword Solving \sep
  Evaluation \sep
  Language Models \sep
  Italian \sep
  Shared Task
\end{keywords}

%%
%% This command processes the author and affiliation and title
%% information and builds the first part of the formatted document.

\maketitle

\section{Introduction and Background}
\label{sec:intro}

Historically, language games have been an important testbed for
creating and studying complex decision-making programs, largely due to a fundamental property: no fixed set of rules will be sufficient to define the overall gameplay. Given the involvement of natural language, in which the interpretation of meaning play a crucial role, judgment is needed not only to produce a solution but even to interpret the rules themselves and, since natural language can be used to describe the full range of human experiences \cite{shapiro1992ai}, language games are inherently inconsistent with the closed-world assumption \cite{littman-2000}, according to which anything not explicitly defined is assumed not to hold. 

Consequently, prior work has characterized language games such as crossword puzzles as AI-complete problems \cite{webcrow}, as solving them requires human-level knowledge and natural language understanding capabilities. For these reasons, language games are emerging as valuable testbeds for evaluating and enhancing the reasoning abilities of Language Models (LMs).

Among language games, crossword puzzles represent a particularly challenging and multifaceted task that requires not only linguistic competence but also cultural knowledge, lateral thinking, and the ability to interpret ambiguous or polysemous clues~\cite{english-crossword,cryptic-crosswords,saha-etal-2025-language,sadallah-etal-2025-makes}. As a result, solving crosswords involves complex semantic and pragmatic reasoning, making this setting ideal for testing models’ deeper language understanding capabilities beyond surface-level aspects.

Before the advent of modern LMs, most approaches to crossword solving and clue answering relied on retrieval-based methods and shallow lexical and semantic features~\cite{webcrow,italian-crossword-solver}. For example, \citet{barlacchi2014retrieval} proposed a system that exploited lexical resources and similarity metrics to match clues to candidate answers in Italian, while the SACRY system~\cite{moschitti-etal-2015-sacry} incorporated syntactic information and ranking strategies to improve clue-answer matching. However, these systems typically struggle with clues that require deeper interpretative reasoning, such as wordplay, anagrams, or polysemous expressions. Consider, for instance, the clue “Producono con procedimenti lenti”, where "lenti" can mean both “slow” and “lenses” in Italian;\footnote{The phrase can be translated as "[they] produce with slow procedures", or "[they] produce lenses with procedures", with an unusual but acceptable constituent order in the latter.} a viable answer could be \textit{ottici} (opticians), illustrating the type of ambiguity traditional systems often fail to resolve.

Despite the impressive advancements in Large Language Models (LLMs), their performance on language games such as crosswords remains limited, especially in morphologically rich and less-resourced languages like Italian~\cite{sarti-etal-2024-non,sarti-etal-2024-eurekarebus,ciaccio2025crosswords}. Existing LMs and retrieval-based systems still fall short when faced with clues requiring subtle reasoning or cultural grounding.

Building on this line of research, the Cruciverb-IT task organized at EVALITA 2026~\cite{evalita2026overview} represents the first shared task specifically dedicated to crossword solving. The initiative was designed to encourage research in this direction by providing a challenging testbed for developing and evaluating systems on crossword puzzle solving.

\section{Definition of the Task}
\label{sec:task}

The Cruciverb-IT shared task is organized into two subtasks:

\paragraph{Subtask 1: Clue Answering.} The first task consists of answering clues extracted from Italian crosswords. Specifically, the task is formatted as a question-answering problem: participants are presented with a set of clues $C = \{c_1, c_2, \dots, c_n\}$ and are asked to build a system that for a given clue $c_i$ is able to produce one or multiple candidate solutions $\hat{S} = \{\hat{s}_1, \hat{s}_2, \dots, \hat{s}_m\}$, possibly containing the correct answer $s_i$. To simulate a more realistic crossword solving scenario and to further guide the systems towards the correct answer space, each clue $c_i$ is paired with the character length of the target answer $s_i$. For example: given the clue and the target character length \textit{Sono un fiore di straordinaria bellezza, 4}, the systems should produce a list of one or more candidates, i.e. \textit{[iris, \textbf{rosa}, rose, yuzu, fior, ...]} eventually containing the correct answer \textit{\textbf{rosa}}.

\paragraph{Subtask 2: Grid Solving.} The second task consists of autonomously solving Italian crossword grids. The participants are presented with a set of empty crossword grids $G = \{\textbf{G}_1, \textbf{G}_2, \dots, \textbf{G}_k\}$ where each grid $\textbf{G}_i$ is paired with a list of clues, each one annotated with the $(x, y)$ coordinates of the square where the corresponding solution starts in the grid and the direction, either down (\textit{verticale}) or across (\textit{orizzontale}). A crossword grid consists of a matrix $\textbf{G}_i$ of size $\mathbb{R}^{n \times n}$ and each square is either blank or a black square. The developed systems should autonomously fill the grid with appropriate solutions, yielding a fully or partially filled crossword grid that ensures consistent overlap between the characters of crossing words and maximizes the number of correctly placed solutions.

\subsection{Task Constraints}
\label{sec:task_constraints}

Participants were allowed to take part in either subtask or both. For both tasks, we enforced a set of constraints to ensure a fair comparison across systems and prevent training contamination. In particular, the use of external data sources that explicitly contain crossword clues or clue--answer pairs was strictly forbidden. All other types of external data and resources were permitted, including but not limited to dictionaries, encyclopedic resources (e.g., Wikipedia), lexical databases (e.g., WordNet), pre-trained or fine-tuned language models, and distributional representations. Participants were required to explicitly report all external data and resources used in developing their systems. This restriction was introduced to avoid trivial memorization effects and to prevent scenarios in which systems could exploit large collections of gold crossword data combined with highly engineered search strategies to achieve artificially high performance, potentially comparable to or even exceeding that of professional human solvers \cite{ginsberg2011dr}\footnote{\url{https://en.wikipedia.org/wiki/Dr.Fill}}. By disallowing crossword-specific external resources, we aimed to foster the development of models that genuinely address clue interpretation, lexical retrieval, and reasoning.

\section{Dataset}
\label{sec:dataset}

For the proposed task, we relied on both the~\textsc{ItaCW} crossword dataset~\cite{italian-crossword-generation} and on a collection of additional clue-solution pairs found on the web. The final dataset, after duplicates are removed, contains approximately 410,000 clue-answer pairs, encompassing various types of puzzles, including wordplay, cryptic clues, named-entity initials, and fill-in-the-blank clues. For the first task, the dataset was divided into training (90\%), validation (5\%), and test (5\%) sets, resulting in approximately 370,000 training examples, and 20,000 examples each for validation and testing. Splits were released as \textit{.csv} files containing three columns: \textit{clue}, \textit{answer} and \textit{answer\_length}, with \textit{answer} columns omitted in the test set.

For the second task, we automatically generated crossword grids by employing a constraint-driven, search-based construction algorithm designed to populate a predefined crossword layout with valid words from the list of answers contained in the aforementioned train, validation and test splits, respectively. Specifically, we first generated several empty and square matrices by placing black squares (with various proportions of the total number of squares) randomly, although ensuring symmetry in the layout, and, subsequently, we populated the grid with the aforementioned algorithm. Lastly, we collected the corresponding clues for each word in the grid, therefore obtaining several complete and plausible crosswords. We generated crosswords of different sizes in order to account for various levels of complexity: $5\times5$, $7\times7$, $9\times9$, $11\times11$ and $13\times13$ with the number of black squares (as a percentage of the overall available squares) being, respectively, 15\%, 16\%, 22\%, 27\% and 27\%. 
%Moreover, to avoid sampling extremely rare and complex words, we limited the search only for those that are in the top 50\% of ItWac sorted by frequency.
Specifically, each empty crossword grid is represented as a matrix, i.e. a list of lists, where each square is either blank (noted as a whitespace ' ') or a black square (noted as a dot '.'). On the other hand, given a grid, the corresponding clues are a list of dictionaries with keys \textit{answer}, \textit{clue}, \textit{x}, \textit{y}, \textit{direction}, \textit{length}, where the coordinates, (\textit{x}, \textit{y}, respectively, rows and columns) expresses where the corresponding solution starts in the grid, the length is the solution number of characters and the direction, noted as \textit{D} or \textit{A}, indicates if the solution should be placed either down (\textit{D}) or across (\textit{A}). As for the first task, we divided the dataset into training (500 grids), validation (50 grids) and test (50 grids) sets. More specifically, the grids were generated following a predefined distribution over grid sizes. The training set consists of 300 grids of size 5×5, 150 of size 7×7, 25 of size 9×9, 15 of size 11×11, and 10 of size 13×13. Both the validation and test sets include 10 grids for each grid size (i.e., 10 grids of size 5×5, 7×7, 9×9, 11×11, and 13×13). Accordingly, participants were asked to return as output the completed crossword grids produced by their systems, represented in the same matrix-based format as the input empty grids, with blank squares filled with the predicted answers\footnote{The dataset can be found at the following HuggingFace repository: \url{https://huggingface.co/datasets/cruciverb-it/evalita2026}}.%Empty grids will be represented as follows:

\begin{comment}
\begin{verbatim}
[
    [' ', ' ', ' ', ' ', ' ', ' ', ' '],
    [' ', ' ', ' ', ' ', ' ', '.', '.'],
    [' ', ' ', ' ', ' ', '.', ' ', ' '],
    [' ', ' ', ' ', ' ', ' ', ' ', ' '],
    [' ', ' ', '.', ' ', ' ', ' ', ' '],
    ['.', '.', ' ', ' ', ' ', ' ', ' '],
    [' ', ' ', ' ', ' ', ' ', ' ', ' ']
]

\end{verbatim}

Clues will be formatted as follows (answers will be excluded in the test set):

\begin{verbatim}
    [('SPREZZO', 'Il temerario lo ha per la vita', 0, 0, 'A'), 
    ('AIOSA', 'In abbondanza', 1, 0, 'A'), 
    ('FEBO', 'Appellativo di Apollo, dio della bellezza', 2, 0, 'A'), 
    ('JE', 'Le prime di Jefferson', 2, 5, 'A'), 
    ('AVANCES', 'I tentativi del "dragueur"', 3, 0, 'A'), 
    ('DE', 'Il Profundis inizio di un Salmo', 4, 0, 'A'), 
    ('ELAT', "Porto d'Israele", 4, 3, 'A'), 
    ('TRENI', 'Viaggiano su rotaie', 5, 2, 'A'), 
    ('CANOSSA', "Il castello d'una storica Matilde", 6, 0, 'A'), 
    ('SAFAD', "Una traslitterazione di una delle
    quattro città sante dell'Ebraismo", 0, 0, 'D'), 
    ('PIEVE', 'Parrocchia di campagna', 0, 1, 'D'), 
    ('ROBA', 'Quella da matti è frequente!', 0, 2, 'D'), 
    ('TN', 'Tazio, asso del volante (iniz.)', 5, 2, 'D'), 
    ('ESONERO', "Può subirlo anche l'allenatore", 0, 3, 'D'), 
    ('ZA', 'La fine della tolleranza', 0, 4, 'D'), 
    ('CLES', 'Località della Val di Non', 3, 4, 'D'), 
    ('JEANS', 'Si fanno con il denim', 2, 5, 'D'), 
    ('ESTIA', 'La Vesta degli antichi Greci', 2, 6, 'D')]
\end{verbatim}

Lastly, the filled grid should be formatted as follows:

\begin{verbatim}
[
    ['S', 'P', 'R', 'E', 'Z', 'Z', 'O'],
    ['A', 'I', 'O', 'S', 'A', '.', '.'],
    ['F', 'E', 'B', 'O', '.', 'J', 'E'],
    ['A', 'V', 'A', 'N', 'C', 'E', 'S'],
    ['D', 'E', '.', 'E', 'L', 'A', 'T'],
    ['.', '.', 'T', 'R', 'E', 'N', 'I'],
    ['C', 'A', 'N', 'O', 'S', 'S', 'A']
]

\end{verbatim}
\end{comment}

\section{Evaluation}
\label{sec:evaluation}

The evaluation of the systems was conducted with specific metrics per task, as follows:

\begin{itemize}
    \item Task-1: \textbf{Accuracy@1/10}, that is the accuracy in retrieving the correct solution word given the corresponding clue, considering the top 1 and 10 candidates produced by the system; \textbf{Mean Reciprocal Rank (MRR)}, that is the average of the reciprocal ranks of the first relevant item across all clues.
    \item Task-2: \textbf{\% of correct characters (CharAcc, CA)}, that is the accuracy in inserting the correct characters in the correct slots; \textbf{\% of correct words (WordAcc, WA)}, accuracy in inserting the correct word in the correct slots; \textbf{\% of grids solved correctly (FullMatch, FM)}, the accuracy in solving the entire grid. Partially filled grids were evaluated by counting empty squares as errors. 
\end{itemize}

\paragraph{Baselines} For clues-answering, our baseline is obtained by approaching the task as an information retrieval problem: given a clue $c_1$ from the test set $C_{test} = \{c_1, ..., c_n\}$, our system ranks the most similar clues by computing a similarity score between $c_1$ and each clue in the training set $C_{train} = \{c_1, ..., c_m\}$. After selecting the top ten most similar clues, we extract the corresponding ten answers. The similarity scores between clues are estimated using the BM25 algorithm \cite{robertson2009probabilistic}, a well-established ranking function in Information Retrieval. For solving crossword grids, our baseline is computed by combining the aforementioned ranker baseline with an additional module that optimizes for a solution by maximizing satisfied constraints while respecting the grid’s hard constraints. Specifically, by treating crossword puzzles as a weighted Max-SMT problem, as partially described in \cite{kulshreshtha-etal-2022-across}, the baseline optimizes for a solution by defining hard constraints (the grid structure) and soft constraints (candidate ranking preferences). Each clue corresponds to a disjunctive group of grid variables constrained to match candidate answers, combined conjunctively across the grid. The formulation uses the Z3 optimizer \cite{de2008z3}\footnote{We modified an open-source implementation: \url{https://github.com/pncnmnp/Crossword-Solver}.} with 10 candidates per clue.


%For the task of solving crossword grids, our baseline is computed by leveraging the aforementioned ranker baseline combined with an additional module that optimizes for a solution by maximizing satisfied constraints while respecting the grid’s hard constraints. By treating crossword puzzles as a weighted Max-SMT problem, as partially described in \cite{kulshreshtha-etal-2022-across}, we defined a set of hard and soft logical constraints over the grid variables (squares): each clue corresponds to a sequence of grid variables constrained to match one of its candidate answers, obtained through the task-1 baseline, forming a disjunctive (OR) group. These candidate-level constraints are then combined conjunctively (AND) across all clues. Intersections are enforced implicitly by shared cell variables (i.e., crossing words write into the same cell), ensuring character consistency between overlapping horizontal and vertical words. Each candidate is paired with a corresponding ranking weight\footnote{Given a list of candidates $\hat{S} = \{\hat{s}_i, \hat{s}_{i+1}, \dots, \hat{i}_n\}$, the weight score for a candidate $s_i$ is simply $|\hat{S}| - i$.} in order to consider candidate importance as a soft preference during maximization. The final formulation is passed to the Z3 optimizer\footnote{We modify an open-source implementation: \url{https://github.com/pncnmnp/Crossword-Solver} \cite{de2008z3}.}, which satisfies all hard constraints and maximizes the weighted satisfaction of soft constraints. Importantly, our baseline approach can yield partially filled grids. We simply run the solver with a candidate size of 10 per clue.

\section{Submitted Systems and Participants}
\label{sec:systems}


Following a call for interest, 5 teams registered for the task and submitted their predictions, for a total of 17 runs (11  for subtask~1 and and 6 for subtask~2). As shown in Table~\ref{tab:submitted_systems}, three teams participated only in subtask 1, while two submitted runs for both tasks.

\paragraph{AC/DG} \cite{ac-dg} AC/DG adopts a retrieval-based framework that combines lexical, semantic, and hybrid reranking strategies. Given a clue and a target length, all retrieval methods operate under strict length constraints, restricting the search space to training instances whose solutions match the target length. The first component is a sparse lexical retriever based on BM25, designed to capture explicit term overlap and definitional clues. In parallel, a dense retrieval model, fine-tuned on Italian and based on a Sentence-BERT encoder, maps clues into a latent semantic space, allowing the system to retrieve morphologically and semantically related candidates even when lexical overlap is limited. The two retrieval streams are combined in a hybrid retrieve-and-rerank strategy, where the top candidates from both BM25 and dense retrieval are passed to an LLM (Qwen3 8B \cite{yang2025qwen3}) acting as a zero-shot judge. This model evaluates, reorders, and, when necessary, augments the candidate set by generating a fallback solution. The final output is obtained by selecting the highest-ranked answer from this reranked list, thereby balancing precision from lexical matching with semantic generalization and generative reasoning.

\paragraph{FFT-UniBa} \cite{fft-uniba} FFT-UNIBA adopts a two-stage approach corresponding to the two subtasks. For Task~1, the authors fine-tune an encoder-decoder model based on IT5 \cite{sarti-nissim-2024-it5}, pre-trained on Italian texts, introducing length-aware special tokens to explicitly control answer generation. Each input is augmented with a pair of tokens marking the expected solution length, including a length-dependent end-of-sequence token. This design encourages the model to internalize length constraints during training, reducing generation errors caused by length mismatches without relying on post-hoc filtering. For Task~2, each crossword is formulated as a constraint satisfaction problem, where variables correspond to clue slots, and domains consist of ranked candidate answers generated by the Task~1 model. When necessary, the candidate sets are augmented with a small number of dictionary-based words matching the required length and letter pattern. Grid constraints enforce character consistency at word intersections. Crossword solving is performed via a depth-first backtracking search, initialized from single-word seed configurations and guided by model perplexity scores. To ensure tractable inference, the system enforces explicit limits on node expansions and dynamically adapts candidate cutoffs and search budgets based on grid size.


\paragraph{MINDS} \cite{minds} MINDS frames crossword clue answering as a masked language modeling problem using an encoder-only architecture. Given a clue and the target word length in characters, the input is constructed by appending a templated sequence in which the answer is represented by a span of \texttt{[MASK]} tokens, together with an explicit indication of the expected length. An Italian BERT model is fine-tuned to reconstruct the masked span from the clue context, using standard masked language modeling with cross-entropy loss applied only to the answer positions. At inference time, since the number of subword tokens corresponding to the answer is unknown, the system queries the model with multiple hypothesized mask lengths. For each length, top-\(K\) predictions are extracted for each masked position and combined to form candidate answers, which are scored using the geometric mean of token probabilities. Candidates generated across different mask lengths are merged into a single ranked list, keeping the highest score for duplicates. Invalid candidates are pruned based on character length and symbol constraints, and the final output consists of the top-ranked valid answers. This strategy allows an encoder-only model to approximate variable-length generative behavior for crossword solving.

\begin{table*}[t!]
    \centering
    %\scriptsize
    \resizebox{\textwidth}{!}{
    \begin{tabular}{lclccc}
    \hline
    \textbf{Team} & \textbf{Members} & \textbf{Affiliation} & \textbf{Task} & \textbf{Runs T1} & \textbf{Runs T2} \\
    \hline
    AC/DG & 4 & Politecnico di Torino & 1 & 3 & - \\
      FFT-UniBa   &  5 & Università degli Studi di Bari Aldo Moro & 1,2 & 4 & 4 \\
      MINDS & 1 & Politecnico di Torino & 1 & 1 & -\\
      UNIBA & 1 & Università degli Studi di Bari Aldo Moro & 1,2 & 2 & 2\\
      UniTor & 2 & Reveal Srl; Università degli Studi di Roma Tor Vergata & 1 & 1 & -\\
      \hline
    \end{tabular}
    }
    \caption{Teams participating in the EVALITA 2026 Cruciverb-IT shared task. For each team, we detail the number of team members, their affiliations, the sub-task(s) they participated in, and the number of submitted runs per subtask (T1 and T2).}
    \label{tab:submitted_systems}
\end{table*}


\paragraph{UNIBA} \cite{uniba} UNIBA addresses the crossword-solving task by mimicking a human-like incremental solving strategy based on partial solutions and cross-checking. The approach relies on an encoder--decoder transformer trained to generate answers conditioned not only on the clue, but also on a partially filled solution, when available. Training data are expanded by masking one or more characters in each gold answer, generating all possible partial solutions, and enriching the input with the number of missing characters and the expected answer length. The model is based on IT5-Large and is trained exclusively on the task-provided data, without external lexical resources. During inference, candidate answers are generated dynamically at each step based on the current grid state. The crossword filling task is instead performed using a beam search strategy that iteratively selects and expands the most promising partial grids. To improve candidate selection, the system employs a binary classifier trained on simulated crossword-solving trajectories, which scores candidate answers using grid-level, clue-level, and generation-based features. Candidates are ranked by classifier confidence and decoder scores, allowing the system to balance solution quality and search efficiency. The final submission excludes models using special tokens, which showed inferior validation performance.


\paragraph{UniTor} \cite{unitor} UniTor proposes a retrieval-grounded, LLM-based system that formulates crossword clue answering as a constrained ranking problem. Given a clue and target length, the system combines retrieval-augmented evidence with structured LLM prompting to generate and rank candidate solutions. In a first stage, UniTor retrieves length-compatible clue–solution pairs from a large indexed repository using lexical (i.e. BM25), neural (i.e. BGE-M3 \cite{multi2024m3}), or hybrid similarity, and injects them into the prompt as lightweight few-shot evidence. In a second stage, candidate generation and ranking are performed within a single LLM call through a structured two-phase prompt. The model is first instructed to explore a diverse set of plausible candidates, prioritizing recall and semantic coverage, and then to filter, normalize, and re-rank them under hard structural constraints such as exact length. This explicit separation between exploration and selection is designed to improve ranking stability and constraint adherence while avoiding multiple LLM interactions. The system outputs a probability-ranked list of candidate answers, where scores represent relative confidence rather than calibrated probabilities. UniTor is evaluated across multiple instruction-tuned LLMs of varying scale to analyze the contributions of model capacity, retrieval grounding, and structured reasoning strategies to crossword-solving performance. The final submitted run is based on the GLM 4.6 model \cite{zeng2025glm}.

\begin{comment}
\begin{table}[t!]
\centering
\begin{tabular}{lrrr}
\hline
\textbf{Team}                         & \textbf{Char Acc.} & \textbf{Word Acc.} & \textbf{Full Match} \\
\hline
FFT-UniBa\_c1000\_1       & 0.92      & 0.85      & 0.34       \\
FFT-UniBa\_c1000NODICT\_1 & 0.92      & 0.85      & 0.32       \\
FFT-UniBa\_c100\_1        & 0.93      & 0.86      & 0.32       \\
FFT-UniBa\_c100\_2        & 0.93      & 0.85      & 0.30       \\
FFT-UniBa\_c100NODICT\_1  & 0.92      & 0.84      & 0.28       \\
FFT-UniBa\_c1000\_2       & 0.93      & 0.84      & 0.28       \\
FFT-UniBa\_c10\_1         & 0.92      & 0.84      & 0.26       \\
FFT-UniBa\_c100NODICT\_2  & 0.91      & 0.82      & 0.24       \\
FFT-UniBa\_c1000NODICT\_2 & 0.91      & 0.82      & 0.22       \\
FFT-UniBa\_c10\_2         & 0.91      & 0.82      & 0.22       \\
FFT-UniBa\_c10NODICT\_1   & 0.90      & 0.80      & 0.20       \\
FFT-UniBa\_c10NODICT\_2   & 0.89      & 0.79      & 0.18       \\
UNIBA\_Run1                  & 0.82      & 0.66      & 0.16       \\
UNIBA\_Run2                  & 0.82      & 0.67      & 0.16       \\
Baseline                     & 0.73      & 0.58      & 0.08      \\
\hline
\end{tabular}
\caption{Cruciverb-IT Subtask 2 leaderboard.}
\label{tab:task2_results}
\end{table}
\end{comment}

\begin{comment}
\begin{table*}[t!]
\centering
\tiny
\begin{tabular}{l|lll|lll|lll|lll|lll|lll}
\hline
                          & \multicolumn{3}{|c|}{\textbf{Overall}}                      & \multicolumn{3}{c|}{\textbf{5x5}}            & \multicolumn{3}{c|}{\textbf{7x7}}            & \multicolumn{3}{c|}{\textbf{9x9}}            & \multicolumn{3}{c|}{\textbf{11x11}}          & \multicolumn{3}{c}{\textbf{13x13}}          \\
                          \hline
\textbf{Team}                      & \textbf{CA}       & \textbf{WA}         & \textbf{FM} & \textbf{CA} & \textbf{WA} & \textbf{FM} & \textbf{CA} & \textbf{WA} & \textbf{FM} & \textbf{CA} & \textbf{WA} & \textbf{FM} & \textbf{CA} & \textbf{WA} & \textbf{FM} & \textbf{CA} & \textbf{WA} & \textbf{FM} \\
\hline
FFT-UniBa\_c1000\_1       & 0.92            & 0.85              & 0.34       & 1.00      & 1.00      & 1.00       & 0.94      & 0.88      & 0.60       & 0.92      & 0.83      & 0.10       & 0.91      & 0.82      & 0.00       & 0.85      & 0.73      & 0.00       \\
FFT-UniBa\_c1000NODICT\_1 & 0.92            & 0.85              & 0.32       & 1.00      & 0.98      & 0.90       & 0.95      & 0.90      & 0.60       & 0.92      & 0.85      & 0.10       & 0.91      & 0.81      & 0.00       & 0.84      & 0.73      & 0.00       \\
FFT-UniBa\_c1000\_2       & 0.93            & 0.84              & 0.28       & 0.98      & 0.94      & 0.70       & 0.94      & 0.86      & 0.50       & 0.94      & 0.86      & 0.20       & 0.89      & 0.79      & 0.00       & 0.87      & 0.76      & 0.00       \\
FFT-UniBa\_c1000NODICT\_2 & 0.91            & 0.82              & 0.22       & 0.97      & 0.90      & 0.70       & 0.91      & 0.82      & 0.30       & 0.93      & 0.84      & 0.10       & 0.91      & 0.81      & 0.00       & 0.85      & 0.74      & 0.00       \\
UNIBA\_Run1               & 0.82            & 0.66              & 0.16       & 0.86      & 0.72      & 0.40       & 0.86      & 0.72      & 0.30       & 0.85      & 0.69      & 0.10       & 0.80      & 0.65      & 0.00       & 0.71      & 0.52      & 0.00       \\
UNIBA\_Run2               & 0.82            & 0.67              & 0.16       & 0.89      & 0.77      & 0.40       & 0.85      & 0.72      & 0.30       & 0.83      & 0.68      & 0.10       & 0.80      & 0.65      & 0.00       & 0.70      & 0.52      & 0.00       \\
Baseline                  & 0.73 & 0.58 & 0.08       &  0.85         &   0.71        &     0.40       &     0.68      &   0.49        &    0.00        &     0.74      &   0.62        &    0.00        &     0.66      &      0.51     &     0.00       &     0.73      &     0.59      &    0.00   \\
\hline
\end{tabular}
\caption{Cruciverb-IT Subtask 2 leaderboard. Systems are ranked according to their overall performance (FM), with results reported both globally (across all crossword grids) and separately for each grid size (5×5, 7×7, 9×9, 11×11, 13×13).}
\label{tab:task2_results_all}
\end{table*}
\end{comment}

\section{Results and Discussion}
\label{sec:discussion}

In the following, we report and discuss the results achieved by the participants with a further quantitative analysis highlighting relevant influencing factors across both tasks, the potential of an ensemble that leverages all participants predictions and similarities between systems predictions.

\begin{table}[t]
\centering
\scriptsize

\begin{tabular}{c c}

% -------- SUBTASK 1 --------
\begin{tabular}{lccc}
\toprule
\multicolumn{4}{c}{\textbf{Subtask 1}} \\
\midrule
\textbf{Systems} & \textbf{Acc@1} & \textbf{Acc@10} & \textbf{MRR} \\
\midrule
UniTor                    & 0.69 & 0.83 & 0.72 \\
FFT-UniBa\_Constrained1   & 0.58 & 0.75 & 0.63 \\
MINDS                     & 0.59 & 0.71 & 0.62 \\
FFT-UniBa\_Constrained2   & 0.57 & 0.75 & 0.62 \\
FFT-UniBa\_Unconstrained1 & 0.55 & 0.72 & 0.60 \\
FFT-UniBa\_Unconstrained2 & 0.54 & 0.73 & 0.60 \\
AC/DG\_Embeddings         & 0.51 & 0.73 & 0.57 \\
AC/DG\_BM25               & 0.47 & 0.67 & 0.53 \\
AC/DG\_Hybrid\_Qwen3\_Judge & 0.46 & 0.69 & 0.52 \\
UNIBA RUN2                & 0.43 & 0.59 & 0.47 \\
Baseline                  & 0.40 & 0.62 & 0.46 \\
UNIBA RUN1                & 0.36 & 0.54 & 0.41 \\
\bottomrule \\[1pt]
\multicolumn{4}{c}{\textbf{(a)}}
\end{tabular}
\hspace{0.99em} &

% -------- SUBTASK 2 --------
\begin{tabular}{lccc}
\toprule
\multicolumn{4}{c}{\textbf{Subtask 2}} \\
\midrule
\textbf{Systems} & \textbf{CA} & \textbf{WA} & \textbf{FM} \\
\midrule
FFT-UniBa\_c1000\_2       & 0.92 & 0.85 & 0.34 \\[7.9pt]
FFT-UniBa\_c1000NODICT\_2 & 0.92 & 0.85 & 0.32 \\[7.9pt]
FFT-UniBa\_c1000\_1       & 0.93 & 0.84 & 0.28 \\[7.9pt]
FFT-UniBa\_c1000NODICT\_1 & 0.91 & 0.82 & 0.22 \\[7.9pt]
UNIBA RUN2                & 0.82 & 0.67 & 0.16 \\[7.9pt]
UNIBA RUN1                & 0.82 & 0.66 & 0.16 \\[7.9pt]
Baseline                  & 0.73 & 0.58 & 0.08 \\
\bottomrule \\[1pt]
\multicolumn{4}{c}{\textbf{(b)}}
\end{tabular}

\end{tabular}

\caption{Cruciverb-IT leaderboard. Subtask~1 (\textbf{a}) ranked by MRR; Subtask~2 (\textbf{b}) according to the three metrics: CharAcc (CA), WordAcc (WA) and FullMatch (FM). Subtask~2 results are ranked by FM.}
\label{tab:cruciverb_joint}
\end{table}

% Results for the Cruciverb-IT subtasks highlight the current limitations of NLP systems in crossword solving.

\subsection{Subtask 1: Clue Answering} Table~\ref{tab:cruciverb_joint} (\textbf{a}) reports the leaderboard of systems participating in Subtask 1. First of all, we observe that most systems outperformed the baseline, indicating that the task cannot be reduced to a simple retrieval problem and that neural and hybrid approaches can capture deeper semantic and inferential patterns than a traditional BM25-based matching strategy. The retrieval-augmented LLM approach adopted by UniTor achieved the best performance, outperforming fine-tuned encoder-decoder models by a significant margin (+11\% Acc@1 gain over the second-best system), leveraging a backbone LLM that is, parameter-wise, approximately 6000 times larger than the one used by FFT-UNIBA. This suggests that, while grounding generation with retrieved items provides a valuable contextual guidance for LLMs, the number of parameters remains a strong predictor of final performance, even in crossword clues. Despite this, the Acc@10 gap between UniTor and FFT-UNIBA, MINDS, AC/DG (ranging from 12\% to 8\%) with respect to the large gap in terms of free parameters, clearly showcases the standalone strength of traditional clues-retrieval approaches and the usage of small neural language models trained with a task-specific fine-tuning strategy, especially in terms of the efficiency/accuracy trade-off. To further support this, we replicate the asymmetrical dual-encoder approach proposed in \citet{ciaccio2025crosswords} on the official subtask 1 data, yielding an Acc@10 of 78\%, further closing the gap (+5\%) with the best system despite, again, a sensibly smaller parameter size\footnote{We used the paraphrase-multilingual-mpnet-base-v2 in a dual-encoder setup with $\approx 556$M free parameters.}.  


\begin{comment}
\begin{table}[t!]
\centering
\scriptsize
\begin{minipage}[t]{0.48\textwidth}
\centering
\begin{tabular}{lrrr}
\toprule
\textbf{Team} & \textbf{Acc@1} & \textbf{Acc@10} & \textbf{MRR} \\
\midrule
UniTor                    & 0.69 & 0.83 & 0.72 \\
FFT-UniBa\_Constrained1   & 0.58 & 0.75 & 0.63 \\
MINDS                     & 0.59 & 0.71 & 0.62 \\
FFT-UniBa\_Constrained2   & 0.57 & 0.75 & 0.62 \\
FFT-UniBa\_Unconstrained1 & 0.55 & 0.72 & 0.60 \\
FFT-UniBa\_Unconstrained2 & 0.54 & 0.73 & 0.60 \\
AC/DG\_Embeddings         & 0.51 & 0.73 & 0.57 \\
AC/DG\_BM25               & 0.47 & 0.67 & 0.53 \\
AC/DG\_Hybrid\_Qwen3\_Judge & 0.46 & 0.69 & 0.52 \\
UNIBA RUN2                & 0.43 & 0.59 & 0.47 \\
Baseline                  & 0.40 & 0.62 & 0.46 \\
UNIBA RUN1                & 0.36 & 0.54 & 0.41 \\
\bottomrule
\end{tabular}
\caption{Cruciverb-IT Subtask 1 leaderboard.}
\label{tab:task1_results}
\end{minipage}
\hfill
\begin{minipage}[t]{0.48\textwidth}
\centering
\begin{tabular}{l|ccc}
\toprule
\textbf{Team} & \textbf{CA} & \textbf{WA} & \textbf{FM} \\
\midrule
FFT-UniBa\_c1000\_1       & 0.92 & 0.85 & 0.34 \\
FFT-UniBa\_c1000NODICT\_1 & 0.92 & 0.85 & 0.32 \\
FFT-UniBa\_c1000\_2       & 0.93 & 0.84 & 0.28 \\
FFT-UniBa\_c1000NODICT\_2 & 0.91 & 0.82 & 0.22 \\
UNIBA RUN2                & 0.82 & 0.67 & 0.16 \\
UNIBA RUN1                & 0.82 & 0.66 & 0.16 \\
Baseline                  & 0.73 & 0.58 & 0.08 \\
\bottomrule
\end{tabular}
\caption{Cruciverb-IT Subtask 2 leaderboard.}
\label{tab:task2_overall}
\end{minipage}
\end{table}
\end{comment}

\begin{comment}
\begin{table}
\centering
\scriptsize
\begin{tabular}{lrrr}
\hline
\textbf{Team}                      & \textbf{Acc@1} & \textbf{Acc@10} & \textbf{MRR}  \\
\hline
UniTor                    & 0.69  & 0.83   & 0.72 \\
FFT-UniBa\_Constrained1   & 0.58  & 0.75   & 0.63 \\
MINDS                     & 0.59  & 0.71   & 0.62 \\
FFT-UniBa\_Constrained2   & 0.57  & 0.75   & 0.62 \\
FFT-UniBa\_Unconstrained1 & 0.55  & 0.72   & 0.60 \\
FFT-UniBa\_Unconstrained2 & 0.54  & 0.73   & 0.60 \\
AC/DG\_Embeddings         & 0.51  & 0.73   & 0.57 \\
AC/DG\_BM25               & 0.47  & 0.67   & 0.53 \\
AC/DG\_Hybrid\_Qwen3\_Judge          & 0.46  & 0.69   & 0.52 \\
UNIBA RUN2              & 0.43  & 0.59   & 0.47 \\
%FFT-UniBa\_1              & 0.40  & 0.63   & 0.46 \\
Baseline                  & 0.40  & 0.62   & 0.46 \\
UNIBA RUN1             & 0.36  & 0.54   & 0.41 \\
\hline
\end{tabular}
\caption{Cruciverb-IT Subtask 1 leaderboard. Scores are ranked according to MRR.}
    \label{tab:task1_results}
\end{table}
\end{comment}

%Interestingly, the gap between Acc@1 and Acc@10 (0.69 to 0.83 for UniTor) indicates that correct answers are often present in the candidate lists but not always ranked first, suggesting room for improvement in re-ranking strategies. The constrained variants of FFT-UniBa, which incorporate length-aware special tokens, consistently outperform their unconstrained counterparts, suggesting that explicitly modeling answer length constraints during training can be more effective than relying on post-hoc filtering.

\begin{figure}[t!]
    \centering
    \includegraphics[width=1\linewidth]{images/task_1_corr.png}
    \caption{Average pairwise Jaccard similarity (Top10) between all systems sets predictions (a value of 1 indicates a perfect overlap).}
    \label{fig:task_1_corr}
\end{figure}

\paragraph{System Similarity and Oracle Ensemble.} To assess potential similarities between systems, we computed the average pairwise Jaccard similarity across all systems' prediction sets. Specifically, given a test instance $t_i$ and two systems $f_o$ and $f_p$ producing the candidates lists $\hat{S}_o$ and $\hat{S}_p$, the Jaccard similarity between $\hat{S}_o$ and $\hat{S}_p$ is obtained by $J_{t_i} = \frac{|\hat{S}_o \cap \hat{S}_p|}{|\hat{S}_o \cup \hat{S}_p|}$. By comparing all possible pairs of systems and averaging these values across all clues in the test set, as shown in Figure \ref{fig:task_1_corr}, we show that runs from the same team tend to cluster together and exhibit strong similarity, while there is almost no overlap across different teams. These results reveal that participants leveraged different approaches, yielding heterogeneous candidate lists. To further assess the impact of prediction diversity between systems, we built an oracle ensemble by taking the union of all systems’ top-k candidate sets (k=1 and k=10) for each clue, and counting a prediction as correct if any system included the gold answer. This approach resulted in upper-bound Acc@1 and Acc@10 of 85\% and 94\%, respectively. The marked improvements from the best system scores (Acc@1 +16\%, Acc@10 +11\%) highlight the diversity between systems' predictions and the synergistic potential of combining the approaches proposed by the participants. 

\paragraph{Influencing Factors.} Several influencing factors were found with respect to the system's predictions\footnote{All reported correlations are Pearson coefficients with $p < 0.05$.}. Specifically, accuracy scores tend to clearly decrease as the answer's characters number increases, suggesting that longer answers are harder to predict while shorter ones are easier (see Figure \ref{fig:task_1_freqs-lens}, on the right) despite the presence of an initial drop for answers of length 2 (a set that usually includes wordplay, abbreviations, initials, etc.). Interestingly, while all systems follow the same trend, the UNIBA RUN2 is more resilient to this aspect, achieving competitive results for answers longer than 7 characters. Coherently, by inspecting the impact of the answers' frequencies\footnote{Frequencies are computed on a 2021 Italian Wikipedia dump.}, we found a strong positive correlation across all systems (see Figure \ref{fig:task_1_freqs-lens}, on the left). Moreover, we also report a negative correlation between Acc@10 and clues lengths -- probably denoting longer clues that are harder to interpret -- across all systems, ranging from -0.53 to -0.9, with the notable exception of UniTor showing no statistically significant correlation.

\paragraph{System Agreement and Lexical Exposure.} We further analyzed the systems’ errors by inspecting the degree of agreement across participants. For each clue–answer instance in the test set, we computed the percentage of systems that correctly predicted the gold answer. We then investigated how this agreement relates to lexical exposure by distinguishing whether the gold answer appeared in the training data\footnote{The distribution of agreement levels with respect to training set coverage is reported in Appendix~\ref{app:influencing-factors}.}. Our analysis reveals that lexical exposure has a pronounced effect primarily at higher agreement levels. Instances for which all systems correctly predicted the answer almost exclusively involve words that were observed during training (6236 and 2 instances, respectively), whereas unseen answers are almost absent in this subset. In contrast, for instances that were mistakenly predicted by (almost) all systems, the distribution between seen and unseen words is nearly balanced. These results suggest that, given the heavy training set dependence of the proposed approaches, the presence of an answer word in the training data has a substantial impact on the system’s ability to consistently retrieve it at test time, leading to strong agreement across participants. Conversely, in instances characterized by widespread errors, the presence or absence of the answer in the training set does not provide a clear advantage, indicating that lexical exposure alone is insufficient to overcome more challenging clues.

\paragraph{Qualitative Analysis of Shared Errors.} Finally, we conducted a qualitative analysis of the errors shared by all systems, focusing on the subset of test instances that were consistently mistaken and therefore represent the most challenging cases. To this end, we applied a TF–IDF representation to the clues and performed unsupervised K-Means clustering to identify recurring patterns in these hard instances. Inspecting the results from a qualitative standpoint\footnote{An excerpt of the identified clusters, along with clue examples and the top terms extracted with TF-IDF are reported in Appendix \ref{app:influencing-factors}.}, we noticed the presence of two clear macro-categories. A first group comprises clues that require access to specific cultural or encyclopedic knowledge (e.g., references to well-known public figures, films, or classical quotations), which are likely harder to solve. %in the absence of explicit retrieval or structured knowledge sources. 
A second group consists of inherently ambiguous clues, for which a unique answer may not exist in isolation. Such clues are typically disambiguated only when embedded within a crossword grid (e.g., generic clues such as \textit{“Città francese”} or \textit{“Nome d’uomo”}\footnote{Transl. "French city", "male name".}), a contextual constraint only available in Subtask 2.

\begin{figure}[t!]
    \centering
    \includegraphics[width=1.0\linewidth]{images/task_1_freqs-lens_plot.png}
    \caption{On the left (\textbf{a}), the plot shows the Acc@10 of each run across log frequency bins along with their respective Pearson (r) correlations; the gray dashed-line marks the average answer length per bin. On the right (\textbf{b}), the plot shows Acc@10 of each run across different answer lengths.}
    \label{fig:task_1_freqs-lens}
\end{figure}


\subsection{Subtask 2: Grid Solving} 
\label{sec:grid-solving}

Table \ref{tab:cruciverb_joint} (\textbf{b}) reports the leaderboard of systems participating in Subtask 2. All systems achieved substantially higher results than the baseline thanks to a combination of stronger clue-answering experts and specifically tailored grid-solving algorithms. Both teams employed a similar pipeline, leveraging the systems developed for subtask 1 to generate candidate answers for each clue, then applying a search algorithm to fill the grid while enforcing crossing constraints. Hence, no team exploited the training and validation datasets released for subtask 2. 

\begin{comment}
\begin{table*}[h!]
\centering
\scriptsize

\begin{minipage}[t]{0.47\textwidth}
\centering
\vspace*{0mm} % top align
\begin{tabular}{l|ccc}
\toprule
\textbf{Team} & \textbf{CA} & \textbf{WA} & \textbf{FM} \\
\midrule
FFT-UniBa\_c1000\_1       & 0.92 & 0.85 & 0.34 \\
FFT-UniBa\_c1000NODICT\_1 & 0.92 & 0.85 & 0.32 \\
FFT-UniBa\_c1000\_2       & 0.93 & 0.84 & 0.28 \\
FFT-UniBa\_c1000NODICT\_2 & 0.91 & 0.82 & 0.22 \\
UNIBA RUN2                & 0.82 & 0.67 & 0.16 \\
UNIBA RUN1                & 0.82 & 0.66 & 0.16 \\
Baseline                  & 0.73 & 0.58 & 0.08 \\
\bottomrule
\end{tabular}
\vspace*{\fill}
\caption{Cruciverb-IT Subtask 2 leaderboard, FM ranked.}
\label{tab:task2_overall}
\end{minipage}
\hfill
\begin{minipage}[t]{0.47\textwidth}
\centering
\vspace*{0mm}
\begin{tabular}{l|cccc}
\toprule
\textbf{Team} & \textbf{1} & \textbf{2} & \textbf{3} & \textbf{4} \\
\midrule
FFT-UniBa\_c1000\_1        & 0.73 & 0.89 & 0.92 & 0.95 \\
FFT-UniBa\_c1000\_2        & 0.74 & 0.88 & 0.92 & 0.95 \\
FFT-UniBa\_c1000NODICT\_2  & 0.71 & 0.87 & 0.92 & 0.95 \\
FFT-UniBa\_c1000NODICT\_1  & 0.72 & 0.87 & 0.91 & 0.95 \\
UNIBA RUN2                 & 0.60 & 0.70 & 0.81 & 0.88 \\
UNIBA RUN1                 & 0.60 & 0.72 & 0.81 & 0.89 \\
Baseline                   & 0.53 & 0.68 & 0.74 & 0.75 \\
\bottomrule
\end{tabular}
\vspace*{\fill}
\caption{Character accuracy by number of surrounding \\ non-black cells (columns).}
\label{tab:intersection_table}
\end{minipage}

\end{table*}
\end{comment}

The FFT-UNIBA runs achieved by far the best results, reaching a character-level accuracy of 92\% and correctly solving 34\% of the grids in the test set. Given the higher performance obtained in Subtask 1, we hypothesize that the strength of the FFT-UNIBA clue-answering system, which acts as the main semantic bottleneck in the solving pipeline, plays a major role in the observed gap with UNIBA.
Moreover, their multiple-seed strategy with high-confidence ranking, backtracking, and a constrained depth-first search approach proved effective in mitigating potentially incorrect early placements and, overall, appears well suited to the combinatorial nature of crossword solving, suggesting that explicit constraint propagation is crucial for grid-level reasoning. 

\begin{table*}[t!]
\centering
\scriptsize
\resizebox{\textwidth}{!}{
\begin{tabular}{l|ccc|ccc|ccc|ccc|ccc}
\toprule
\multirow{2}{*}{\textbf{Systems}} & \multicolumn{3}{c|}{\textbf{5×5}}
& \multicolumn{3}{c|}{\textbf{7×7}}
& \multicolumn{3}{c|}{\textbf{9×9}}
& \multicolumn{3}{c|}{\textbf{11×11}}
& \multicolumn{3}{c}{\textbf{13×13}} \\
\cmidrule{2-16}
& \textbf{CA} & \textbf{WA} & \textbf{FM} & \textbf{CA} & \textbf{WA} & \textbf{FM} & \textbf{CA} & \textbf{WA} & \textbf{FM} & \textbf{CA} & \textbf{WA} & \textbf{FM} & \textbf{CA} & \textbf{WA} & \textbf{FM} \\
\midrule
FFT-UniBa\_c1000\_2       
& 1.00 & 1.00 & 1.00 
& 0.94 & 0.88 & 0.60 
& 0.92 & 0.83 & 0.10 
& 0.91 & 0.82 & 0.00 
& 0.85 & 0.73 & 0.00 \\

FFT-UniBa\_c1000NODICT\_2
& 1.00 & 0.98 & 0.90 
& 0.95 & 0.90 & 0.60 
& 0.92 & 0.85 & 0.10 
& 0.91 & 0.81 & 0.00 
& 0.84 & 0.73 & 0.00 \\

FFT-UniBa\_c1000\_1      
& 0.98 & 0.94 & 0.70 
& 0.94 & 0.86 & 0.50 
& 0.94 & 0.86 & 0.20 
& 0.89 & 0.79 & 0.00 
& 0.87 & 0.76 & 0.00 \\

FFT-UniBa\_c1000NODICT\_1
& 0.97 & 0.90 & 0.70 
& 0.91 & 0.82 & 0.30 
& 0.93 & 0.84 & 0.10 
& 0.91 & 0.81 & 0.00 
& 0.85 & 0.74 & 0.00 \\

UNIBA RUN2               
& 0.89 & 0.77 & 0.40 
& 0.85 & 0.72 & 0.30 
& 0.83 & 0.68 & 0.10 
& 0.80 & 0.65 & 0.00 
& 0.70 & 0.52 & 0.00 \\


UNIBA RUN1             
& 0.86 & 0.72 & 0.40 
& 0.86 & 0.72 & 0.30 
& 0.85 & 0.69 & 0.10 
& 0.80 & 0.65 & 0.00 
& 0.71 & 0.52 & 0.00 \\

Baseline                  
& 0.85 & 0.71 & 0.40 
& 0.68 & 0.49 & 0.00 
& 0.74 & 0.62 & 0.00 
& 0.66 & 0.51 & 0.00 
& 0.73 & 0.59 & 0.00 \\

\bottomrule
\end{tabular}
}
\caption{Cruciverb-IT Subtask 2 results reported separately for each grid size (5×5, 7×7, 9×9, 11×11, 13×13).}
\label{tab:task2_by_size}
\end{table*}

A clear pattern emerges when analyzing performance across grid sizes in Table~\ref{tab:task2_by_size}: while systems achieve near-perfect accuracy on 5×5 grids (up to 100\% FM for the best system), performance degrades steeply as grid size increases, with the best model dropping from 100\% to 60\% FM when moving from 5×5 to 7×7 grids. No system achieved a complete solution on 11×11 or 13×13 grids, highlighting the exponential growth in complexity as the number of interdependent constraints increases. %The FFT-UniBa system, combining a fine-tuned answer generator with constraint satisfaction search, substantially outperforms the beam-search approach of the UNIBA system, suggesting that explicit constraint propagation is crucial for grid-level reasoning.


\begin{comment}
    
\begin{table}
\centering
\scriptsize
%\resizebox{\linewidth}{!}{
\begin{tabular}{lcccc}
\toprule
 & \multicolumn{4}{c}{\textbf{Intersection number}} \\
\cmidrule(lr){2-5}
\textbf{System} & \textbf{1} & \textbf{2} & \textbf{3} & \textbf{4} \\
\midrule
FFT-UniBa\_c1000\_1               & 0.73 & 0.89 & 0.92 & 0.95 \\
FFT-UniBa\_c1000\_2               &  0.74 & 0.88 & 0.92 & 0.95 \\
FFT-UniBa\_c1000NODICT\_2         & 0.71 & 0.87 & 0.92 & 0.95 \\
FFT-UniBa\_c1000NODICT\_1         & 0.72 & 0.87 & 0.91 & 0.95 \\
UNIBA RUN2 & 0.60 & 0.70 & 0.81 & 0.88 \\
UNIBA RUN1 & 0.60 & 0.72 & 0.81 & 0.89 \\
\bottomrule
\end{tabular}

\caption{Character accuracy by the number of surrounding non-black cells.}
\label{tab:intersection_table}
\end{table}

\end{comment}


\begin{comment}


\begin{figure}[t!]
    \centering
    \includegraphics[width=0.5\linewidth]{images/stats_subtask2.png}
    \caption{WA accuracies of each run across answer lengths (number of characters); the gray dashed-line marks the average target frequency.}
    \label{fig:task_2_freqs-lens}
\end{figure}

\begin{table}[t!]
\scriptsize
    \centering
    \begin{tabular}{l|cccc}
\toprule
 & \multicolumn{4}{c}{\textit{\textbf{Intersection number}}} \\
\cmidrule(lr){2-5}
\textbf{System} & \textbf{1} & \textbf{2} & \textbf{3} & \textbf{4} \\
\midrule
FFT-UniBa\_c1000\_1        & 0.73 & 0.89 & 0.92 & 0.95 \\
FFT-UniBa\_c1000\_2        & 0.74 & 0.88 & 0.92 & 0.95 \\
FFT-UniBa\_c1000NODICT\_2  & 0.71 & 0.87 & 0.92 & 0.95 \\
FFT-UniBa\_c1000NODICT\_1  & 0.72 & 0.87 & 0.91 & 0.95 \\
UNIBA RUN2                 & 0.60 & 0.70 & 0.81 & 0.88 \\
UNIBA RUN1                 & 0.60 & 0.72 & 0.81 & 0.89 \\
Baseline                   & 0.53 & 0.68 & 0.74 & 0.75 \\
\bottomrule
\end{tabular}
\caption{Character accuracy by number of surrounding non-black cells (columns). Systems here are ranked by the overall CA.}
\label{tab:intersection_table}
\end{table}

\end{comment}

\begin{figure}[t!]
    \centering
    
    % ----------- LEFT: PLOT -----------
    \begin{minipage}[t]{0.48\linewidth}
        \vspace{0pt}
        \centering
        \includegraphics[width=\linewidth]{images/stats_subtask2.png}
        \caption{WA accuracies of each run across answer \\ 
        lengths (number of characters); the gray dashed-line \\ 
        marks the average target frequency.}
        \label{fig:task_2_freqs-lens}
    \end{minipage}
    \hfill
    % ----------- RIGHT: TABLE -----------
    \begin{minipage}[t]{0.48\linewidth}
        \vspace{0pt}
        \centering
        \scriptsize
        \begin{tabular}{l|cccc}
        \toprule
         & \multicolumn{4}{c}{\textit{\textbf{Intersection number}}} \\
        \cmidrule(lr){2-5}
        \textbf{System} & \textbf{1} & \textbf{2} & \textbf{3} & \textbf{4} \\
        \midrule
        FFT-UniBa\_c1000\_1        & 0.73 & 0.89 & 0.92 & 0.95 \\
        FFT-UniBa\_c1000\_2        & 0.74 & 0.88 & 0.92 & 0.95 \\
        FFT-UniBa\_c1000NODICT\_2  & 0.71 & 0.87 & 0.92 & 0.95 \\
        FFT-UniBa\_c1000NODICT\_1  & 0.72 & 0.87 & 0.91 & 0.95 \\
        UNIBA RUN2                 & 0.60 & 0.70 & 0.81 & 0.88 \\
        UNIBA RUN1                 & 0.60 & 0.72 & 0.81 & 0.89 \\
        Baseline                   & 0.53 & 0.68 & 0.74 & 0.75 \\
        \bottomrule
        \end{tabular}
        \vspace{7.3mm}
        \captionof{table}{Character accuracy by number of \\ 
        surrounding non-black cells (columns). Systems \\ 
        here are ranked by the overall CA.}
        \label{tab:intersection_table}
    \end{minipage}
    
\end{figure}


\paragraph{Influencing Factors.} For Subtask~2, the influence of lexical and structural factors on system performance appears less pronounced than in Subtask~1. In particular, the relationship between FullMatch accuracy and answer length is weaker. As shown in Figure~\ref{fig:task_2_freqs-lens}, accuracy generally increases from answers of length 2 up to length 6, followed by a moderate decrease for longer answers, without the marked downward trend observed in Subtask~1. Overall, the drop in performance for longer words is noticeably less severe, and the curves across systems exhibit a smoother behavior. A similar pattern emerges when considering answer frequency. While average target frequency decreases across length bins, accuracy does not show a strong monotonic decline. Instead, performance initially increases and only decreases for the lowest-frequency bins, with a substantially milder effect compared to Subtask~1. A plausible explanation for these trends lies in the specific configuration of Subtask~2, which integrates a crossword grid solver into the prediction pipeline. In this setting, the final predictions do not solely reflect the behavior of the underlying neural models, but rather the interaction between the models and the solver. As a consequence, the solver may act as a filtering and re-ranking component, partially mitigating the impact of both lexical frequency and answer length, and thereby smoothing the correlations observed in Figure~\ref{fig:task_1_freqs-lens} (\textbf{b}). Moreover, the absence of a sharp performance drop for longer answers can be attributed to the presence of multiple grid constraints: although longer words are generally harder to predict in isolation, their instantiation within a crossword grid provides additional crossing letters and structural cues, which can facilitate their recovery with respect to Subtask~1.

\paragraph{The Role of Intersections.} Table~\ref{tab:intersection_table} provides insight into why constraint-based approaches succeed: character accuracy increases monotonically with the number of intersecting words per cell (from 0.73 to 0.95 for FFT-UniBa). Cells with more intersections benefit from additional constraints that help disambiguate among candidate answers, effectively allowing the system to cross-check predictions. This finding aligns with human solving strategies, where solvers often rely on crossing words to confirm or reject candidate answers. Focusing in particular on the FFT-UniBa system, its multiple configurations reveal that larger candidate pools generally improve Full Match scores by increasing the likelihood of including the correct answer in the search space. Dictionary augmentation also provides modest but consistent improvements, particularly for rare words that may not appear in the model's top predictions. We deem the trade-off between the gains from these approaches and the search and computational complexity an interesting topic for future research on efficient automatic crossword solvers.

\begin{figure}[t!]
    \centering
    \includegraphics[width=0.9\linewidth]{images/task_2_crosses.png}
    \caption{Example grids taken from the test set of subtask 2. The grids (see Appendix \ref{app:crossword-clues} for the corresponding clues) are populated with the correct answers and each cell is colored (green to red) by the fraction of systems that answered correctly (green cell = all systems predicted the correct character in that cell; red cell = no system predicted the correct character for that cell).}
    \label{fig:crosses-examples}
\end{figure}

Figure \ref{fig:crosses-examples} shows three 13×13 crossword grids taken from the test set and populated by gold solutions with each cell colored by the fraction of systems that correctly predicted the corresponding character. These examples provide qualitative visual evidence supporting the trend reported in Table~\ref{tab:intersection_table}: systems tend to produce incorrect answers more frequently for isolated cells, particularly those corresponding to short entries or located near grid borders. For example, in Figure \ref{fig:crosses-examples} crossword \textbf{(c)}, no system was able to correctly fill numerous isolated cells such as the ones for clues 9, 15, 27, 34, 36, 54, 55 originating, often, regions of high density errors. Similar patterns are notable, also, in grids \textbf{(a)} and \textbf{(b)}. This common trend highlights the performance advantage of constraint-based approaches when structural redundancy is available for disambiguation, while also underscoring the importance of a robust clue-answering expert for cases with limited or no intersection hints.

% Gab: Discuss here the difference in scale between Unitor and FFT-Uniba models?

% \begin{tcolorbox}[font=\fon{ppl}\itshape,width=1\linewidth, colframe=gray, colback=blue!15!gray!15, boxsep=2mm, arc=3mm]
% text goes here 

% \smallskip

% \end{tcolorbox}

\subsection{Human and Artificial Solvers}
\label{sec:human_and_artificial_solvers}

Looking at the core challenges encountered by the models in the Cruciverb-IT tasks, we have wondered whether such challenges are consistent with the difficulties perceived by human solvers. We have also asked ourselves whether the automatically created grids are somewhat in line with grids created by humans, and how they would be received by human creators and solvers. While running a full study with human subjects was not feasible at this time (although it might be considered in future extensions of this work), consultation with Italian crossword expert Stefano Bartezzaghi yielded some interesting considerations. We report them below.

\paragraph{Length} 
At the level of the underlying models, longer solutions are generally harder to predict, with the exception of two-letter words, which often correspond to abbreviations, initials, or wordplay phenomena. This behaviour is consistent with the trends observed in Subtask~1, where predictions are produced in isolation. However, when considering the complete systems used in Subtask~2, i.e.\ the combination of neural models and the grid solver, the role of answer length becomes less clear-cut. In this setting, the negative impact of longer words is attenuated, and performance tends to increase up to medium-length answers, followed by only a mild decrease for longer ones. This suggests that the constraints imposed by the crossword grid and the re-ranking performed by the solver partially compensate for the intrinsic difficulty of predicting longer strings in isolation. In particular, longer answers benefit from a higher number of crossings, which provide additional character-level constraints and can facilitate their recovery within the grid. Interestingly, this behaviour brings artificial systems closer to human solving strategies. For human solvers, longer words can in fact be easier to retrieve, due to the larger number of constraints and to the presence of standard bound morphemes, such as \emph{-zione}, \emph{post-}, \emph{sub-}, \emph{-abile}, \emph{-mento}, which further restrict the set of plausible candidates and often make the solution more predictable. Moreover, longer and morphologically transparent words can help fill the grid more efficiently by constraining neighbouring entries.A similar distinction emerges for clue length. At the model level, longer clues tend to be more difficult to handle. From a human perspective, instead, difficulty is primarily determined by the degree of focus and ambiguity of the clue rather than by its length.\footnote{For instance, the solution \emph{Torino} can be clued with increasingly specific definitions, such as \emph{``città italiana''} (transl.\ ``Italian city''), \emph{``capoluogo di regione italiana''} (transl.\ ``capital of an Italian region''), and \emph{``capoluogo del Piemonte''} (transl.\ ``capital of Piedmont''), where more detailed, and thus often longer definitions progressively reduce ambiguity and are typically perceived as easier by human solvers.} In fact, longer clues are often more informative and less ambiguous than short and highly compact ones, and can therefore be perceived as easier to solve.
 
 
\paragraph{Frequency} 
At the level of the neural models, lexical frequency plays a major role: more frequent answers are generally more likely to be predicted correctly. This behaviour clearly emerges in Subtask~1, where systems operate without grid-level constraints. When considering the full systems employed in Subtask~2, however, the effect of frequency becomes weaker and less monotonic. Although frequent words still tend to be favoured, accuracy does not sharply decrease for low-frequency answers, and the overall trend appears substantially smoother. As in the case of answer length, this attenuation can be attributed to the presence of the grid solver, which integrates the predictions of the models with structural constraints derived from the crossword grid. As a result, the final output reflects the interaction between lexical preferences learned by the models and the combinatorial constraints enforced by the solver, rather than lexical frequency alone. This behaviour partially aligns artificial systems with human solving strategies. For human solvers, common words are in general easier to retrieve, but the specificity and ambiguity of the clue often play a more decisive role than raw lexical frequency. For instance, even a very frequent word such as \emph{``albero''} (transl.\ ``tree'') can become difficult to recover when the clue is vague (e.g.\ ``vegetation''), semantically ambiguous (e.g.\ ``Maestro di barca''), or relies on an idiosyncratic contextualization, that is, a clue grounded in a highly specific and unconventional frame rather than in a direct lexical relation (e.g.\ \emph{``La sequoia lo è del mammut''}\footnote{Transl.\ ``The sequoia is the mammoth’s one''. The clue relies on an implicit historical context: sequoias are extremely long-lived trees, and some specimens were already alive when mammoths still existed. The solution is therefore \emph{``albero''}, which can only be retrieved by reconstructing this implicit and unconventional contextual relation.}). In this respect, the solver-based setting of Subtask~2 reduces the dominance of frequency-driven behaviour observed at the model level, and yields a performance profile that is closer to the human perception of difficulty.

%%%%%%%% OLD VERSION %%%%%%%%%%
%\paragraph{Length} For models, longer solutions are harder to predict, with the exception of two-letter words which often represent abbreviations, initials, etc. For humans, instead, longer words can be at times easier, due to the higher number of constraints, and also due to the fact that longer words often contain standard bound morphemes, such as -zione, post-, sub- -abile, -mento, which further constrain the solution and might make it more predictable. In addition, longer words with highly predictable affixes can help to fill the rest of the board more quickly. For models, length also plays a role in the definitions, with longer cues being more difficult than shorter ones. Again, this does not correspond to how humans perceive difficulty in definitions: rather than length itself, what primarily matters is how focused and unambiguous the clue is.\footnote{For instance, the solution \emph{Torino} can be clued with increasingly specific definitions, such as \emph{``città italiana''} (transl. ``Italian city''), \emph{``capoluogo di regione italiana''} (transl. ``capital of an Italian region''), and \emph{``capoluogo del Piemonte''} (transl. ``capital of Piedmont''), where more detailed, and thus often longer definitions progressively reduce ambiguity and are typically perceived as easier by human solvers.} In fact, longer cues are often more informative and less ambiguous than shorter and more cryptic ones, and can therefore be perceived as easier to solve than very compact definitions.

% \paragraph{Frequency} For models, the more frequent a term, the more likely the system is to guess it. This is similar for human solvers: common words are easier to retrieve. Yet again, a bigger role is played by the specificity of the definition and its degree of ambiguity. For example, let us consider the solution \emph{``albero''} (transl. ``tree''), which is a common word in Italian. A broad definition as ``vegetation'', or an ambiguous one like ``Maestro di barca'', or a idiosyncratic contextualization, i.e. a clue that relies on a highly specific and non-conventional contextual frame rather than on direct lexical relation (``\textit{La sequoia lo è del mammut}''\footnote{Transl. ``The sequoia is the mammoth’s one''. The clue relies on an implicit historical context: sequoias are known to be extremely long-lived trees, and some specimens were already alive when mammoths still existed. The solution is therefore \emph{``albero''}, which can only be retrieved by reconstrucing this implicit and unconventional contextual relation.}), can make a very frequent term dificult to guess.


\paragraph{Grid features} While for models larger grids are very difficult, with no system being able to solve 11x11 and 13x13 crosswords in Cruciverb-IT, for human solvers the grid size per se does not play a direct role in the complexity of the task. One aspect instead which appears to be consistent across artificial and human solvers is cell isolation: the more neighbors a cell has, and therefore the more constraints, the easier it is to fill that with the correct character.


\paragraph{Do the grids feel ``human''?} As a general point, we have been wondering to what extent the generated grids resemble good examples of human-crafted crosswords and if they do come across as artificial. Apparently, the gap is there: in general, the grids we have generated would not be particularly appealing to a human solver, since they lack what crossword experts would call \textit{rhythmic breadth}. This is due to the fact that the generated grids do not offer elaborate crossings and do not form large \textit{squares}, except in very limited extensions.


\section{Conclusion}
\label{sec:conclusion}

We presented Cruciverb-IT, the first shared task on Italian crossword solving, held at EVALITA 2026. The task attracted five participating teams who submitted a total of 17 system runs across two subtasks: clue answering and full grid solving. We released a dataset of approximately 410,000 Italian clue-answer pairs and 600 automatically generated crosswords of varying sizes, providing a new benchmark for evaluating language understanding and reasoning capabilities in Italian.

Our evaluation reveals that modern NLP approaches achieve promising results in individual-clue answering, with the best system reaching 69\% accuracy at rank 1 via retrieval-augmented LLM prompting. For grid solving, constraint-satisfaction methods combined with fine-tuned language models are most effective, achieving up to 92\% character accuracy and solving 34\% of grids completely. However, a clear scalability challenge emerges: while systems reliably solve smaller grids (5×5), the steep performance decrease in larger grids indicates that crossword solving at realistic scales remains an open problem.

In addition, the qualitative analysis reported in Section~\ref{sec:human_and_artificial_solvers}, based on a consultation with Italian crossword expert Stefano Bartezzaghi, highlights a systematic gap between human solving strategies and the behavior of current neural models, as well as a more nuanced picture when considering full crossword-solving systems. In particular, our observations show that factors such as morphological predictability, semantic focusing, ambiguity, and idiosyncratic contextualization of clues play a central role for human solvers. At the same time, model-level predictions are largely driven by surface-level properties such as answer length and lexical frequency, whereas system-level configurations that integrate a grid solver partially mitigate these effects by exploiting structural constraints induced by the crossword grid. This comparison suggests that future crossword-solving systems should explicitly account for higher-level linguistic and pragmatic properties of clues, and not only for grid-level constraints, in order to better approximate human strategies.
%In addition, the qualitative analysis reported in Section~\ref{sec:human_and_artificial_solvers}, based on a consultation with Italian crossword expert Stefano Bartezzaghi, highlights a systematic gap between artificial and human solvers in the way difficulty is perceived and constructed. In particular, our observations show that factors such as morphological predictability, semantic focusing, ambiguity, and idiosyncratic contextualization of clues play a central role for human solvers, whereas current models are primarily driven by surface-level properties such as length and frequency. This comparison suggests that future crossword-solving systems should explicitly account for higher-level linguistic and pragmatic properties of clues, in order to better approximate human strategies.

The findings suggest several directions for future research. First, hybrid architectures that combine the semantic understanding of Large Language Models with explicit constraint reasoning may be required to effectively improve grid-level performance, beyond solving individual clues. Second, the strong correlation between intersection density and accuracy suggests that iterative refinement strategies could be effective for progressively constraining the solution space. Finally, the resource and evaluation framework introduced in this task can serve as a blueprint for exploring crossword solving in other languages and for investigating the broader question of how language models handle linguistic puzzles at the intersection of cultural knowledge and logical reasoning.

\section*{Acknowledgments}
\label{sec:acknowledgments}

We would like to thank Stefano Bartezzaghi for carefully analyzing our results and observations and for providing valuable insights that greatly helped us better interpret the linguistic and pragmatic aspects of the task.
The authors acknowledge the support of the PNRR MUR project \href{https://fondazione-fair.it/}{PE0000013-FAIR}. Partial support was also received by the project “Understanding and Enhancing Preference Alignment in Large Language Models Through Controlled Text Generation” (IsCc8\_ALIGNLLM), funded by CINECA under the ISCRA initiative, for the availability of HPC resources and support.

%% The declaration on generative AI comes in effect
%% in Janary 2025. See also
%% https://ceur-ws.org/GenAI/Policy.html
\section*{Declaration on Generative AI}

During the preparation of this work, the author used GPT-5 and Grammarly to conduct grammar and spelling checking. The author reviewed and edited the content as needed and takes full responsibility for the publication’s content.

%%
%% Define the bibliography file to be used
\bibliography{sample-ceur}

\newpage

\appendix

\section{Influencing Factors}
\label{app:influencing-factors}

Figure \ref{fig:in_training} shows the distribution of test instances across different system agreement levels, split according to whether the answer word appears in the training set. Table \ref{tab:cluster}, instead, reports some examples of the clues and the corresponding top TF-IDF terms extracted by applying K-Means clustering to the set of instances incorrectly predicted by all the systems.

\begin{figure}[h!]
    \centering
    \includegraphics[width=0.6\textwidth]{images/in_training.png}
    \caption{Distribution of test instances across system agreement levels, measured as the percentage of systems that correctly predicted the gold answer, split on whether the answer word appeared in the training set.}
    \label{fig:in_training}
\end{figure}

\begin{table}[h!]
\scriptsize
\setlength{\tabcolsep}{4pt}
\renewcommand{\arraystretch}{1.2}
\begin{tabular}{c p{5cm} p{8cm}}
\toprule
\textbf{Cluster} & \textbf{Top terms}                                                                                                        & \textbf{Clue examples}                                                                                                                                                                            \\
\midrule
1       & italiana, centro, tipo, re, successo, gioco, personaggio, noto, esserlo, piccola                                  & La zona da cui nasce il Po; Malvagia; Si torna agli antichi; Tiranno... spagnolo; Condizione di chi ha due capi                                                                       \\
\midrule
2       & famoso, scrittore, cantante, televisivo, storico, italiano, successo, soneria, soneria personaggio, tipo          & Famoso dramma di Agatha Christie; Giornalista conduttore di un famoso Processo televisivo; Famoso scrittore; É famoso quello da Procida; Famoso cantante italiano                     \\
\midrule
3       & nome, femminile, dà, famosa, tipo, televisivo, successo, scrittore, soneria, soneria personaggio                  & Nome d'uomo; Altro nome dei ricci di mare; Nome di molti spagnoli, messicani e argentini; Altro nome del martin pescatore; Danno nome a una famosa sonata di Beethoven                \\
\midrule
4       & film, famoso, noto, personaggio, nome, televisivo, tipo, scrittore, soneria, soneria personaggio                  & Un famoso film a episodi; La musica che fa da sottofondo al film; É "onorario" quello d'un film con Richard Gere; Un film americano d' autore; Un noto film del regista Joseph Losey  \\
\midrule
5       & celebre, televisivo, tipo, storico, soneria personaggio, soneria, successo, scrittore, roma, provincia            
& È diventato celebre quello di notte; La casa editrice di un celebre dizionarista; Un celebre Scipione; Una celebre massima latina relativa alla salute; Un celebre idillio di Manzoni \\
\midrule
6       & francese, famoso, tipo, storico, soneria personaggio, televisivo, successo, scrittore, %soneria, roma
& Città francese; Pierre, regista cinematografico francese; La Svizzera... francese; Regione francese; Guillaume-Léon, politico francese                                                \\
\midrule
7       & donna, nome donna, nome, storico, tipo, televisivo, successo, scrittore, soneria, soneria %personaggio 
& Una donna generosa; Donna ricaduta nel peccato, per la teologia; Attraente come una donna; Donna che non ci vede più; Donna di casa                                              \\
\midrule
8       & provincia, comune provincia, comune, storico, tipo, televisivo, successo, scrittore, soneria, soneria personaggio & La località misteriosa di oggi (provincia di Bergamo); Comune in provincia di Salerno; Comune in provincia di Lecco; Comune in provincia di Verbania; Lombardi di provincia           \\
%\midrule
%9       & pianta, dà, tipo, storico, soneria personaggio, televisivo, successo, scrittore, soneria, provincia & Pianta erbacea che dà un energico purgante ; Una pianta officinale ; Pianta aromatica ; Lo svilupparsi di steli alla base di una pianta ; Pianta dal girovita abbondante                  \\
%\midrule
%10       & roma, nome, tipo, storico, soneria personaggio, televisivo, successo, scrittore, soneria, provincia & A Roma sono rozzi e emarginati ; Un anagramma di Roma ; Poiché sorge su sette colli, è detta "la Roma tedesca" ; Segue Ara Pacis... nel nome di un monumento di Roma ; È rientrata a Roma \\
\bottomrule
\end{tabular}
    \caption{Examples of clusters obtained by applying TF-IDF + K-Means (K = 10) to the set of instances incorrectly predicted by all systems. Each row reports representative clue examples and the corresponding top TF-IDF terms characterizing the cluster.}
    \label{tab:cluster}
\end{table}


\section{Crossword Clues}
\label{app:crossword-clues}

We report here the crossword clues associated with the grids discussed in Section \ref{sec:grid-solving}.

\begin{table}[!htbp]
{\footnotesize
\begin{tabular}{l@{\hskip 25pt}p{0.5\linewidth}}
\textbf{ACROSS} & \textbf{DOWN} \\
4. Il secreto di certe ghiandole & 1. Assicurazione per chi guida \\
10. Una Pina del teatro & 2. Il primo nome di Volont\`e \\
12. L'equipaggio di una nave pirata & 3. Indaga sulla mafia \\
14. Centro di mira & 4. Iniziali dell'attore Rea \\
15. Tornante nel calcio & 5. Il rimpianto Moro (iniz.) \\
17. Fra capo e collo & 6. Divisione amministrativa svedese \\
19. Agnese a Toledo & 8. Si succedono nella vita \\
22. Il centro di Positano & 9. Leggevano in pubblico gli editti \\
24. Permette di fissare e modellare i capelli & 11. Le sue gesta sono narrate nei due libri dei Re della  \\
25. Quality Assurance & 18. Ugo Gregoretti \\
26. Quella di mezzo \`e preferibile & 20. Fu per anni la grande rivale della Navratilova \\
28. Le separa la O & 21. Comprano e vendono azioni \\
29. Nel cuore delle Ande & 22. Insito, connaturato \\
31. Si susseguono spaventosamente & 23. Tribunale Penale Internazionale \\
35. Metropolitana parigina & 25. Quantit\`a Massima \\
36. \_\_, come lava! diceva Calimero & 27. La Rodriguez, regina del fado \\
37. La si scrive per elogiare & 31. Fuoco francese \\
38. La misura che si prende per tutelarsi & 32. Una voce in fattura \\
42. Le aveva in testa Eva & 33. Il 19 responsabile della pandemia \\
43. Lo dice chi accetta & 34. L'attrice Di Benedetto \\
44. Una cintura per arti marziali & 35. Hanno un gustoso ripieno \\
45. Le consonanti dell'ubiquo & 38. Grosso pesce da tana \\
46. Precede... Tin-Tin & 39. La Cina gli \`e vicina \\
48. Coda di gusta & 40. In quel luogo per Livio \\
49. Quella fissa viene e non se ne va & 41. Un'abbreviazione per... aree \\
51. Lo dicono i Romani... negando & 45. Una ruota del Lotto \\
54. Tante erano le Caravelle di Colombo & 47. Le gemelle... di Ennio \\
56. Articolo per muratore & 50. Uno... starnuto \\
57. Mettere giorno, mese e anno & 52. Il nome del ministro Ronchi \\
62. Piccolo a Torino & 53. Il fumetto di un cane \\
63. Gli uccelli che possono essere cinerini & 55. Da tenera diventa avanzata \\
    & 58. Il Nobel che ha dato il premio e non l'ha ricevuto (iniz.) \\
    & 59. Segue erre ed esse \\
\end{tabular}
}
\caption{Definitions and solutions for Grid \textbf{(a)}.}
\end{table}

\begin{table}[!htbp]
{\footnotesize
\begin{tabular}{l@{\hskip 25pt}p{0.5\linewidth}}
\textbf{ACROSS} & \textbf{DOWN} \\
1. Si porta sulle spalle & 1. Suddividere la posta \\
6. Si mette nel sacco & 2. Si accorre per prestarlo \\
10. Sono pieni di quattrini & 3. Il nucleo del nucleo \\
14. Il centro del Friuli & 4. Noto passo appenninico \\
15. Abito di penitenza & 5. Precede le prime inserzioni dell'elenco \\
16. Hanno preceduto i CD & 6. La rima della fattoria \\
17. Non si toccano in Fort Knox & 7. Le lancia l'atterrito \\
19. L'...aula di Zenone & 8. Con ``tap'' forma un ballo \\
21. Alto vulcano siciliano & 11. Provincia del Lazio \\
23. Lo pu\`o subire il pugile & 12. Cos\`i \`e l'adesione acritica a dogmi \\
24. Il nome dell'attore Selleck & 13. C'\`e chi vorrebbe raddrizzar loro le gambe! \\
25. Si pu\`o far di presenza & 20. A questo punto... degli antichi vati \\
30. Un famoso re di Persia & 22. Prefisso per intelletto \\
32. Iniziali della Cruz & 25. Un po' acerbo \\
34. Canaletti veneziani & 26. Le iniziali di Colleoni \\
37. Ogni religione ha il suo & 33. La scuola con gli interni \\
38. Simbolo dell'erbio & 35. Si apprezza nell'umorista \\
40. Cola sulla leccarda & 37. A lui spettava il governo della Repubblica di Venezia \\
47. Il battesimo della nave & 39. Catanzaro \\
48. Fa gridare i tifosi & 40. Il frutto che si mangia a chicco a chicco \\
50. Una patria della canzone italiana popolare (targa) & 41. Periodo geologico \\
52. Lo indossa l'indiana & 42. Un risultato calcistico \\
53. Un piacere da moderare & 49. Creano costosi monili \\
54. Venite in centro & 52. Un artista molto noto \\
55. La coppia in lotta & 53. Il famoso Bj\"orn del tennis \\
56. Recita un credo ateo & 55. Pubblica guide (sigla) \\
58. Grasso della critica (iniz.) & 57. Global Trade Organization \\
60. Eventi disastrosi & 61. Si beve nel pomeriggio
\end{tabular}
}
\caption{Definitions and solutions for Grid \textbf{(b)}.}
\end{table}

\begin{table}[!htbp]
{\footnotesize
\begin{tabular}{l@{\hskip 25pt}p{0.5\linewidth}}
\textbf{ACROSS} & \textbf{DOWN} \\
1. Le hanno il custode e la guardia & 1. Ci sono da tavola \\
3. Sensitivo & 2. Scoraggiato, avvilito \\
9. Il fumetto di un cane & 3. Ruote da mulino \\
12. La pi\`u splendente stella della Lira & 5. Dilatazione delle cavit\`a cardiache \\
14. International Normalised Ratio (sigla) & 6. Gli spetta una provvigione per quel che ha... combinato \\
17. Molto magri, denutriti & 7. Uccello oceanico delle zone artiche \\
19. Era domani fino a ieri & 10. Ingrossamento \\
22. Capolavoro ricordato con l'Iliade & 13. Il Lerner dell'Infedele \\
23. La ``erre'' dei Greci & 15. Targa del Nord-Ovest \\
26. Nel centro di Montreal & 18. \`E bellissimo... in mezzo \\
27. Ha dato il nome a un ``vizio'' (che peraltro non aveva) & 20. Il sacro calice ricercato da Parsifal \\
31. Orribili Mostriciattoli Genetici & 28. In mezzo al cancan \\
33. Sportello di consulenza fiscale & 32. Si compra in edicola \\
34. Uno dei figli del biblico Caleb & 35. Una carta favorevole \\
36. Il pittore del blu & 41. Il titolo del deputato (abbreviazione) \\
38. La Hardin pianista & 43. Solerte e laborioso \\
40. Andato via & 51. Girano sui colli... \\
42. Il nome del ministro Ronchi & 52. Iniziali di Nuvolari \\
45. Porzione di territorio & 53. Li percorrono le gondole \\
47. Fanno di maggio un miraggio & 55. Preposizione articolata \\
50. Mezzo Zul\`u & 57. Gi\`a latino, sono inglese \\
51. Verbo... canoro & \\
54. La neo... dal fiocco rosa & \\
56. Manicaretti siciliani & \\
60. Indice della ricchezza nazionale & \\
61. L'attrice Tushingham & \\
62. Societ\`a... in breve & \\
64. Dix che non dipinge & \\
66. Pari in forma & \\
\end{tabular}
}
\caption{Definitions and solutions for Grid \textbf{(c)}.}
\end{table}


\end{document}

%%
%% End of file
