%% The first command in your LaTeX source must be the \documentclass command.
%%
%% Options:
%% twocolumn : Two column layout.
%% hf: enable header and footer.
\documentclass[
twocolumn,
% hf,
]{ceurart}

%%
%% One can fix some overfulls
\sloppy

%%
%% Minted listings support 
%% Need pygment <http://pygments.org/> <http://pypi.python.org/pypi/Pygments>
\usepackage{listings}
\usepackage{url}
\usepackage{enumitem}
%% auto break lines
\lstset{breaklines=true}

\newenvironment{tasklist}{
  \begin{itemize}[label={}, leftmargin=0pt, itemindent=0pt]
  \setlength{\parindent}{1em}
  \renewcommand\item{\par\hangindent=1em\hangafter=1\noindent}
}{
  \end{itemize}
}

%%
%% end of the preamble, start of the body of the document source.
\begin{document}

%%
%% Rights management information.
%% CC-BY is default license.
\copyrightyear{2026}
\copyrightclause{Copyright for this paper by its authors.
  Use permitted under Creative Commons License Attribution 4.0
  International (CC BY 4.0).}

%%
%% This command is for the conference information
\conference{EVALITA 2026: 9th Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian, Feb 26 – 27, Bari, IT}

%%
%% The "title" command
\title{Cruciverb-IT @ EVALITA 2026: Overview of the Crossword Solving in Italian Task}


%%
%% The "author" command and its associated commands are used to define
%% the authors and their affiliations.



\author[1,2]{Cristiano Ciaccio}[%
orcid=0009-0001-6113-4761,
email=cristiano.ciaccio@ilc.cnr.it,
]
\address[1]{Department of Computer Science, University of Pisa, Italy}
\address[2]{Institute for Computational Linguistics "A. Zampolli" (CNR-ILC) - ItaliaNLP Lab, Pisa, Italy}

\author[3]{Gabriele Sarti}[%
orcid=0000-0001-8715-2987,
email=g.sarti@rug.nl,
]
\address[3]{Center for Language and Cognition (CLCG), University of Groningen, The Netherlands}

\author[2]{Alessio Miaschi}[%
orcid=0000-0002-0736-5411,
email=alessio.miaschi@ilc.cnr.it,
]

\author[2]{Felice Dell'Orletta}[%
orcid=0000-0003-3454-9387,
email=felice.dellorletta@ilc.cnr.it,
]

\author[3]{Malvina Nissim}[%
orcid=0000-0001-5289-0971,
email=m.nissim@rug.nl,
]

%%
%% The abstract is a short summary of the work to be presented in the
%% article.
\begin{abstract}
  A clear and well-documented \LaTeX{} document is presented as an
  article formatted for publication by CEUR-WS in a conference
  proceedings. Based on the ``ceurart'' document class, this article
  presents and explains many of the common variations, as well as many
  of the formatting elements an author may use in the preparation of
  the documentation of their work.
\end{abstract}

%%
%% Keywords. The author(s) should pick words that accurately describe
%% the work being presented. Separate the keywords with commas.
\begin{keywords}
  NLP \sep
  Crossword Solving \sep
  Evaluation \sep
  Italian
\end{keywords}

%%
%% This command processes the author and affiliation and title
%% information and builds the first part of the formatted document.
\maketitle

\section{Introduction}
\label{sec:intro}

Language games have emerged as valuable testbeds for evaluating and enhancing the reasoning abilities of Language Models (LMs). Among these, crossword puzzles represent a particularly challenging and multifaceted task that requires not only linguistic competence but also cultural knowledge, lateral thinking, and the ability to interpret ambiguous or polysemous clues~\cite{english-crossword,cryptic-crosswords,saha-etal-2025-language,sadallah-etal-2025-makes}. Solving crosswords involves complex semantic and pragmatic reasoning, making this setting ideal for testing models’ deeper language understanding capabilities beyond surface-level similarity.

Before the advent of modern Language Models (LMs), most approaches to crossword solving relied on retrieval-based methods and shallow lexical and semantic features~\cite{webcrow,italian-crossword-solver}. For example, \cite{barlacchi2014retrieval} proposed a system that exploited lexical resources and similarity metrics to match clues to candidate answers in Italian, while SACRY~\cite{moschitti-etal-2015-sacry} incorporated syntactic information and ranking strategies to improve clue-answer matching. However, these systems typically struggle with clues that require deeper interpretative reasoning, such as wordplay, anagrams, or polysemous expressions. Consider, for instance, the clue “Producono con procedimenti lenti”, where lenti can mean both “slow” and “lenses” in Italian; a viable answer could be \textit{ottici} (opticians), illustrating the type of ambiguity traditional systems often fail to resolve.

Despite the impressive advancements in Large Language Models (LLMs), their performance on language games such as crosswords remains limited, especially in morphologically rich and less-resourced languages like Italian~\cite{sarti-etal-2024-non,sarti-etal-2024-eurekarebus,ciaccio2025crosswords}. Existing LMs and retrieval-based systems still fall short when faced with clues requiring subtle reasoning or cultural grounding.

Building on this line of research, the Cruciverb-IT task organized at EVALITA 2026~\cite{evalita2026overview} represents the first shared task specifically dedicated to crossword solving. The initiative was designed to encourage research in this direction by providing a challenging testbed for developing and evaluating systems on crossword puzzle solving.

\section{Definition of the Task}
\label{sec:task}

The Cruciverb-IT shared task is organized into two subtasks:

\paragraph{Subtask 1:} The first task consists of answering clues extracted from Italian crosswords. Specifically, the task is formatted as a question-answering problem: participants are presented with a set of clues $C = \{c_1, c_2, \dots, c_n\}$ and are asked to build a system that for a given clue $c_i$ is able to produce one or multiple candidate solutions $\hat{S} = \{\hat{s}_1, \hat{s}_2, \dots, \hat{s}_m\}$, possibly containing the correct answer $s_i$. To simulate a more realistic crossword solving scenario and to further guide the systems towards the correct answer space, each clue $c_i$ is paired with the character length of the target answer $s_i$. For example: given the clue and the target character length \textit{Sono un fiore di straordinaria bellezza, 4}, the systems should produce a list of one or more candidates, i.e. \textit{[iris, \textbf{rosa}, rose, yuzu, fior, ...]} eventually containing the correct answer \textit{\textbf{rosa}}.

\paragraph{Subtask 2:} The second task consists of autonomously solving Italian crossword grids. The participants are presented with a set of empty crossword grids $G = \{\textbf{G}_1, \textbf{G}_2, \dots, \textbf{G}_k\}$ where each grid $\textbf{G}_i$ is paired with a list of clues, each one annotated with the $(x, y)$ coordinates of the square where the corresponding solution starts in the grid and the direction, either down (\textit{verticale}) or across (\textit{orizzontale}). A crossword grid consists of a matrix $\textbf{G}_i$ of size $\mathbb{R}^{n \times n}$ and each square is either blank or a black square. The developed systems should autonomously fill the grid with the appropriate solutions, yielding a fully or partially filled crossword grid that ensures a consistent overlap between the characters of crossing words and maximizes the number of appropriate solutions correctly placed in the grid.

\section{Dataset}
\label{sec:dataset}

\section{Evaluation}
\label{sec:evaluation}

\section{Submitted Systems and Participants}
\label{sec:systems}

\begin{table*}[t!]
    \centering
    \begin{tabular}{lrrrrr}
    \hline
    \textbf{Team} & \textbf{Members} & \textbf{Affiliation} & \textbf{Subtasks} & \textbf{Runs T1} & \textbf{Runs T2} \\
    \hline
    AC/DG & 4 & Politecnico di Torino & 1 & 2 & - \\
      FFT-UniBa   &  5 & Università degli studi di Bari Aldo Moro & 2 & 4 & 4 \\
      MINDS & 1 & Politecnico di Torino & 1 & 1 & -\\
      UNIBA & 1 & Università degli studi di Bari Aldo Moro & 2 & 2 & 2\\
      Unitor & 2 & Università degli Studi di Roma Tor Vergata; Reveal Srl & 1 & 1 & -\\
      \hline
    \end{tabular}
    \caption{Teams participating in EVALITA 2026 Cruciverb-IT shared task. For each team, we detail the number of team members, their affiliations, the sub-task(s) they participated in, and the number of submitted runs per subtask (T1 and T2).}
    \label{tab:submitted_systems}
\end{table*}

Following a call for interest, 5 teams registered for the task and submitted their predictions, for a total of 16 runs (namely, 10 and 6 for subtask 1 and 2 respectively). As shown in Table \ref{tab:submitted_systems}, some teams participated only in sub-task 1.

\paragraph{AC/DG} \cite{ac-dg}

\paragraph{FFT-UniBa} \cite{fft-uniba} 

\paragraph{MINDS} \cite{minds}

\paragraph{UNIBA} \cite{uniba}

\paragraph{Unitor} \cite{unitor}

\section{Results}
\label{sec:results}

\begin{table}[t!]
\centering
\begin{tabular}{lrrr}
\hline
\textbf{Team}                      & \textbf{Acc@1} & \textbf{Acc@10} & \textbf{MRR}  \\
\hline
Unitor                    & 0.69  & 0.83   & 0.72 \\
FFT-UniBa\_Constrained1   & 0.58  & 0.75   & 0.63 \\
MINDS                     & 0.59  & 0.71   & 0.62 \\
FFT-UniBa\_Constrained2   & 0.57  & 0.75   & 0.62 \\
FFT-UniBa\_Unconstrained1 & 0.55  & 0.72   & 0.60 \\
FFT-UniBa\_Unconstrained2 & 0.54  & 0.73   & 0.60 \\
AC/DG\_Embeddings         & 0.51  & 0.73   & 0.57 \\
AC/DG\_BM25               & 0.47  & 0.67   & 0.53 \\
UNIBA\_Run1               & 0.43  & 0.59   & 0.47 \\
%FFT-UniBa\_1              & 0.40  & 0.63   & 0.46 \\
Baseline                  & 0.40  & 0.62   & 0.46 \\
UNIBA\_Run2               & 0.36  & 0.54   & 0.41 \\
\hline
\end{tabular}
\caption{Cruciverb-IT Subtask 1 leaderboard. Scores are ranked according to MRR.}
    \label{tab:task1_results}
\end{table}

\begin{comment}
\begin{table}[t!]
\centering
\begin{tabular}{lrrr}
\hline
\textbf{Team}                         & \textbf{Char Acc.} & \textbf{Word Acc.} & \textbf{Full Match} \\
\hline
FFT-UniBa\_c1000\_1       & 0.92      & 0.85      & 0.34       \\
FFT-UniBa\_c1000NODICT\_1 & 0.92      & 0.85      & 0.32       \\
FFT-UniBa\_c100\_1        & 0.93      & 0.86      & 0.32       \\
FFT-UniBa\_c100\_2        & 0.93      & 0.85      & 0.30       \\
FFT-UniBa\_c100NODICT\_1  & 0.92      & 0.84      & 0.28       \\
FFT-UniBa\_c1000\_2       & 0.93      & 0.84      & 0.28       \\
FFT-UniBa\_c10\_1         & 0.92      & 0.84      & 0.26       \\
FFT-UniBa\_c100NODICT\_2  & 0.91      & 0.82      & 0.24       \\
FFT-UniBa\_c1000NODICT\_2 & 0.91      & 0.82      & 0.22       \\
FFT-UniBa\_c10\_2         & 0.91      & 0.82      & 0.22       \\
FFT-UniBa\_c10NODICT\_1   & 0.90      & 0.80      & 0.20       \\
FFT-UniBa\_c10NODICT\_2   & 0.89      & 0.79      & 0.18       \\
UNIBA\_Run1                  & 0.82      & 0.66      & 0.16       \\
UNIBA\_Run2                  & 0.82      & 0.67      & 0.16       \\
Baseline                     & 0.73      & 0.58      & 0.08      \\
\hline
\end{tabular}
\caption{Cruciverb-IT Subtask 2 leaderboard.}
\label{tab:task2_results}
\end{table}
\end{comment}

\begin{table*}[t!]
\centering
\scriptsize
\begin{tabular}{l|lll|lll|lll|lll|lll|lll}
\hline
                          & \multicolumn{3}{|c|}{\textbf{Overall}}                      & \multicolumn{3}{c|}{\textbf{5x5}}            & \multicolumn{3}{c|}{\textbf{7x7}}            & \multicolumn{3}{c|}{\textbf{9x9}}            & \multicolumn{3}{c|}{\textbf{11x11}}          & \multicolumn{3}{c}{\textbf{13x13}}          \\
                          \hline
\textbf{Team}                      & \textbf{CA}       & \textbf{WA}         & \textbf{FM} & \textbf{CA} & \textbf{WA} & \textbf{FM} & \textbf{CA} & \textbf{WA} & \textbf{FM} & \textbf{CA} & \textbf{WA} & \textbf{FM} & \textbf{CA} & \textbf{WA} & \textbf{FM} & \textbf{CA} & \textbf{WA} & \textbf{FM} \\
\hline
FFT-UniBa\_c1000\_1       & 0.92            & 0.85              & 0.34       & 1.00      & 1.00      & 1.00       & 0.94      & 0.88      & 0.60       & 0.92      & 0.83      & 0.10       & 0.91      & 0.82      & 0.00       & 0.85      & 0.73      & 0.00       \\
FFT-UniBa\_c1000NODICT\_1 & 0.92            & 0.85              & 0.32       & 1.00      & 0.98      & 0.90       & 0.95      & 0.90      & 0.60       & 0.92      & 0.85      & 0.10       & 0.91      & 0.81      & 0.00       & 0.84      & 0.73      & 0.00       \\
%FFT-UniBa\_c100\_1        & 0.93            & 0.86              & 0.32       & 1.00      & 1.00      & 1.00       & 0.95      & 0.89      & 0.50       & 0.93      & 0.85      & 0.10       & 0.91      & 0.81      & 0.00       & 0.86      & 0.74      & 0.00       \\
%FFT-UniBa\_c100\_2        & 0.93            & 0.85              & 0.30       & 0.99      & 0.97      & 0.80       & 0.94      & 0.86      & 0.50       & 0.94      & 0.85      & 0.20       & 0.90      & 0.79      & 0.00       & 0.87      & 0.76      & 0.00       \\
%FFT-UniBa\_c100NODICT\_1  & 0.92            & 0.84              & 0.28       & 1.00      & 0.98      & 0.90       & 0.94      & 0.87      & 0.40       & 0.93      & 0.86      & 0.10       & 0.88      & 0.76      & 0.00       & 0.85      & 0.73      & 0.00       \\
FFT-UniBa\_c1000\_2       & 0.93            & 0.84              & 0.28       & 0.98      & 0.94      & 0.70       & 0.94      & 0.86      & 0.50       & 0.94      & 0.86      & 0.20       & 0.89      & 0.79      & 0.00       & 0.87      & 0.76      & 0.00       \\
%FFT-UniBa\_c10\_1         & 0.92            & 0.84              & 0.26       & 0.98      & 0.94      & 0.70       & 0.94      & 0.87      & 0.50       & 0.94      & 0.85      & 0.10       & 0.89      & 0.78      & 0.00       & 0.86      & 0.75      & 0.00       \\
%FFT-UniBa\_c100NODICT\_2  & 0.91            & 0.82              & 0.24       & 0.96      & 0.90      & 0.70       & 0.90      & 0.81      & 0.30       & 0.93      & 0.86      & 0.20       & 0.89      & 0.77      & 0.00       & 0.85      & 0.75      & 0.00       \\
FFT-UniBa\_c1000NODICT\_2 & 0.91            & 0.82              & 0.22       & 0.97      & 0.90      & 0.70       & 0.91      & 0.82      & 0.30       & 0.93      & 0.84      & 0.10       & 0.91      & 0.81      & 0.00       & 0.85      & 0.74      & 0.00       \\
%FFT-UniBa\_c10\_2         & 0.91            & 0.82              & 0.22       & 0.97      & 0.92      & 0.70       & 0.94      & 0.87      & 0.40       & 0.92      & 0.82      & 0.00       & 0.89      & 0.80      & 0.00       & 0.83      & 0.72      & 0.00       \\
%FFT-UniBa\_c10NODICT\_1   & 0.90            & 0.80              & 0.20       & 0.95      & 0.87      & 0.60       & 0.92      & 0.84      & 0.30       & 0.91      & 0.82      & 0.10       & 0.86      & 0.74      & 0.00       & 0.84      & 0.73      & 0.00       \\
%FFT-UniBa\_c10NODICT\_2   & 0.89            & 0.79              & 0.18       & 0.93      & 0.85      & 0.60       & 0.92      & 0.84      & 0.30       & 0.90      & 0.83      & 0.00       & 0.86      & 0.73      & 0.00       & 0.82      & 0.70      & 0.00       \\
UNIBA\_Run1               & 0.82            & 0.66              & 0.16       & 0.86      & 0.72      & 0.40       & 0.86      & 0.72      & 0.30       & 0.85      & 0.69      & 0.10       & 0.80      & 0.65      & 0.00       & 0.71      & 0.52      & 0.00       \\
UNIBA\_Run2               & 0.82            & 0.67              & 0.16       & 0.89      & 0.77      & 0.40       & 0.85      & 0.72      & 0.30       & 0.83      & 0.68      & 0.10       & 0.80      & 0.65      & 0.00       & 0.70      & 0.52      & 0.00       \\
Baseline                  & 0.73 & 0.58 & 0.08       &  0.85         &   0.71        &     0.40       &     0.68      &   0.49        &    0.00        &     0.74      &   0.62        &    0.00        &     0.66      &      0.51     &     0.00       &     0.73      &     0.59      &    0.00   \\
\hline
\end{tabular}
\caption{Cruciverb-IT Subtask 2 leaderboard. Systems are ranked according to their overall performance (FM), with results reported both globally (across all crossword grids) and separately for each grid size (5×5, 7×7, 9×9, 10×10, 11×11, 13×13).}
\label{tab:task2_results_all}
\end{table*}

\begin{table}
\centering
\small
\begin{tabular}{lcccc}
\toprule
 & \multicolumn{4}{c}{\textbf{Intersection number}} \\
\cmidrule(lr){2-5}
\textbf{System} & \textbf{1} & \textbf{2} & \textbf{3} & \textbf{4} \\
\midrule
FFT-UniBa::task2\_final23\_cand1000.txt               & 0.73 & 0.89 & 0.92 & 0.95 \\
FFT-UniBa::task2\_final22\_cand1000.txt               & 0.74 & 0.88 & 0.92 & 0.95 \\
FFT-UniBa::task2\_final22\_cand1000NODICT.txt         & 0.71 & 0.87 & 0.92 & 0.95 \\
FFT-UniBa::task2\_final23\_cand1000NODICT.txt         & 0.72 & 0.87 & 0.91 & 0.95 \\
UNIBA::T2\_RUN1\_uniba\_task2\_beam\_NS\_test\_TF.out & 0.60 & 0.72 & 0.81 & 0.89 \\
UNIBA::T2\_RUN2\_uniba\_task2\_beam\_NS\_test\_TT.out & 0.60 & 0.70 & 0.81 & 0.88 \\
\bottomrule
\end{tabular}
\caption{Character accuracy by the number of intersecting words per cell.}
\label{tab:intersection_table}
\end{table}

\section{Discussion}
\label{sec:discussion}

\section{Conclusion}
\label{sec:conclusion}

\section*{Acknowledgments}
\label{sec:acknowledgments}




%%
%% Define the bibliography file to be used
\bibliography{sample-ceur}

%%
%% If your work has an appendix, this is the place to put it.
\appendix

\end{document}

%%
%% End of file
