%% The first command in your LaTeX source must be the \documentclass command.
%%
%% Options:
%% twocolumn : Two column layout. Do not use twocolumn for papers submitted to CEUR-WS!
%% hf: enable header and footer.
\documentclass[
% twocolumn,
% hf,
]{ceurart}

%%
%% One can fix some overfulls
\sloppy

%%
%% Minted listings support 
%% Need pygment <http://pygments.org/> <http://pypi.python.org/pypi/Pygments>
\usepackage{listings}
\usepackage{minted}
\usepackage{cleveref}
\usepackage[table]{xcolor}
\usepackage{tabularx}
\usepackage{diagbox}
%% auto break lines
\lstset{breaklines=true}
\definecolor{eggshell}{RGB}{254,252,246}

%%
%% end of the preamble, start of the body of the document source.
\begin{document}

%%
%% Rights management information.
%% CC-BY is default license.
\copyrightyear{2026}
\copyrightclause{Copyright for this paper by its authors.
  Use permitted under Creative Commons License Attribution 4.0
  International (CC BY 4.0).}

%%
%% This command is for the conference information
\conference{EVALITA 2026: 9th Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian, Feb 26 – 27, Bari, IT}

%%
%% The "title" command
\title{INDAQA2 - 
A Large Italian Narrative QA Benchmark:~\\
A CALAMITA 2026 Challenge}
%\tnotemark[1]
%\tnotetext[1]{You can use this document as the template for preparing your publication. We recommend using the latest version of the ceurart style.}

%%
%% The "author" command and its associated commands are used to define
%% the authors and their affiliations.
\author[1]{Luca Gioffré}[%
orcid=0009-0007-9705-6797,
email=gioffre@diag.uniroma1.it,
url=https://lukfre.github.io/,
]
\cormark[1]
%\fnmark[1]

\author[1]{Luca Moroni}[%
orcid=0009-0006-1210-5098,
email=moroni@diag.uniroma1.it,
]


\author[1,2]{Alberte Fernández-Castro}[%
%orcid=,
email=castro@diag.uniroma1.it,
]

\author[1]{Elena Marafatto}[%
%orcid=,
email=marafatto@diag.uniroma1.it,
]

\author[1]{Giacomo Garufi}[
email=garufi.1750327@studenti.uniroma1.it
]

\author[1,2]{Roberto Navigli}[%
orcid=0000-0003-3831-9706,
email=navigli@diag.uniroma1.it,
]


\address[1]{Sapienza NLP Group, Dip. di Ingegneria Informatica, Automatica e Gestionale, Sapienza University of Rome, Italy}
\address[2]{Babelscape, Rome, Italy}


%% Footnotes
\cortext[1]{Corresponding author.}
%\fntext[1]{These authors contributed equally.}

\begin{abstract}
  Long-context comprehension and reasoning remain largely underexplored in the evaluation of Italian Large Language Models (LLMs). 
    Existing Italian benchmarks primarily focus on short or medium-length inputs, offering limited insight into models' ability to process extended narratives. 
    To address this gap, we introduce INDAQA2, a substantially revised and expanded version of INDAQA, a benchmark for narrative question answering on original Italian literary texts. 
    The new version comprises an expanded corpus of 461 total books, introduces a multiple-choice question answering format alongside the original open-ended tasks, and features manually curated texts drawn exclusively from works originally written in Italian, thus avoiding artifacts introduced by translation. 
    The benchmark evaluates long-context understanding over complete books of up to 250K tokens, testing complementary comprehension skills through a dual-structure design: global narrative understanding, assessed via questions derived from book summaries, and local precision, assessed via questions grounded in specific passages and entity-level details. 
    By supporting both open-ended and multiple-choice question answering formats, INDAQA2 enables evaluation of both generative capabilities and discriminative reasoning, facilitating comprehensive and scalable comparison across models.
    Our evaluation of several Italian-specialized and multilingual models reveals significant performance disparities across task formats and highlights limitations in how current Italian models utilize extended contexts. 
  %This resource provides the first systematic assessment of long-context narrative reasoning for Italian LLMs, grounded in authentic Italian sources and free from translation-induced biases.
\end{abstract}

%%
%% Keywords. The author(s) should pick words that accurately describe
%% the work being presented. Separate the keywords with commas.
\begin{keywords}
  Narratives \sep
  Question-Answering \sep
  Long-context \sep
  Evaluation \sep
  Benchmark
\end{keywords}

%%
%% This command processes the author and affiliation and title
%% information and builds the first part of the formatted document.

\maketitle

\section{Challenge: Introduction and Motivation}
In recent years, an increasing number of newly released Large Language Models (LLMs) have been explicitly designed to process long inputs~\citep{Su2021rope,peng2024yarn}. Context windows have expanded rapidly, from early models limited to a few thousand tokens~\citep{radford2019language} to systems such as Llama 3.1~\citep{grattafiori2024llama3herdmodels}, supporting up to 128K tokens, and Qwen 2.5~\citep{Yang2025Qwen251MTR} and Gemini 1.5~\citep{geminiteam2024gemini15unlockingmultimodal}, which extend this capacity to the order of one million tokens. While these advances are impressive, the ability to handle long contexts introduces new and non-trivial challenges for model evaluation.

In parallel with the development of techniques to extend context length, there has been a surge in benchmarks targeting long-context capabilities. Prominent examples include Needle-in-a-Haystack–style tests~\citep{needle-in-haystack,kuratov2024search} and their long-context extension BABILong~\citep{NEURIPS2024_babilong}, which primarily assess retrieval and robustness under increasing input length. Beyond simple retrieval, benchmarks such as $\infty$Bench~\citep{zhang2024inftybenchextendinglongcontext} extend evaluation to contexts exceeding 100K tokens across diverse domains, while RULER~\citep{hsieh2024ruler} introduces multi-hop tracing and aggregation tasks to test deeper reasoning capabilities.
However, most of these benchmarks are strongly English-centric, with only a limited number of multilingual efforts (such as BABILong-ITA~\citep{tamburini-2025-babilong}), and they rarely focus on deep, narrative-level reasoning.

Despite recent efforts to advance generative evaluation for Italian~\cite{moroni-etal-2024-towards,magnini-etal-2025-leaderboard,puccetti-etal-2025-invalsi,nissim2025challengingabilitieslargelanguage}, a language spoken by over 60 million native speakers and supported by a rich literary tradition, comprehensive benchmarks for evaluating both retrieval and reasoning in long-document abstractive question answering remain largely absent~\cite{moroni-etal-2025-sustainable}.
The absence of such resources hinders the development and evaluation of Italian language models on tasks requiring extended discourse comprehension, a capability increasingly important for applications ranging from document analysis to educational tools and accessibility technologies.

To the best of our knowledge, there exist only two datasets for narrative QA in Italian, but each has significant limitations. 
FairytaleQA-IT~\citep{Leite_2024}, a machine translation into Italian of the FairytaleQA benchmark~\citep{xu-etal-2022-fantastic}, is limited to children's fairy tales and may contain translation artifacts that do not reflect natural Italian language use. 
INDAQA~\citep{moroni-etal-2025-learned}, the first Italian long-context QA benchmark on narratives, adopts the NarrativeQA methodology of generating questions from human-written summaries rather than directly from source texts. While this approach ensures question quality, it may inadvertently constrain question types to those answerable from compressed versions of narratives, rather than questions requiring deep engagement with the full text. 
Moreover, the benchmark lacks a systematic quality check of the source texts, and by supporting only open-ended generative tasks, it requires a carefully tailored evaluation setup, both important factors for building a fair (narrative) benchmark~\citep{bonomo-etal-2025-literaryqa,tedeschi-etal-2023-whats}.
Supporting the multiple-choice task would allow for a more straightforward evaluation, despite its own structural problems~\citep{molfese-etal-2025-right}.


To bridge this gap, we introduce an expanded and substantially refined version of INDAQA, which we name \textbf{INDAQA2}. 
The new benchmark addresses the aforementioned limitations and extends the dataset with an additional collection of 99 long texts paired with (open-ended and multiple-choice) question–answer items generated under different strategies, complementing the challenges and evaluation principles outlined in the CALAMITA initiative~\citep{nissim2025challengingabilitieslargelanguage}.
%This work is conceived as part of the CALAMITA initiative to develop a dynamic and community-driven benchmark for evaluating LLM capabilities in Italian.
As in the original benchmark, all source texts come from gold Italian novels or theatrical screenplays, manually collected and curated. 
Importantly, the newly added works consist primarily of lesser-known literary texts, reducing the likelihood that models have memorized substantial amounts of content from their training data and thus mitigating contamination effects.
Moreover, all texts are authored in Italian and written directly in the language, avoiding artifacts introduced by translation. 
In some cases, the texts also include regional varieties and dialects (e.g., works written by \textit{Goldoni} or \textit{Pirandello}), further increasing linguistic diversity. 
Together, these properties make the benchmark a more reliable and realistic test bed for evaluating long-context comprehension and reasoning in Italian LLMs.
% add what your expectations are regarding model performance, and why.
For the aforementioned reasons, we expect models to perform rather poorly on this new challenge, mainly due to the length of the inputs. 

\section{Challenge: Description}
This challenge is focused on Open-ended (OE) and Multiple-choice (MC) QA over long, narrative texts.
Thus, it supports two main tasks, depending on the expected output of the model (free-form, open-ended generation or choice labels).

We focus on narratives for two main reasons.
First, \textbf{many narrative works are long by-design}, so we can avoid constructing a synthetic benchmark stitching together documents coming from different sources, as done in RULER~\citep{hsieh2024ruler} and BABILong~\citep{NEURIPS2024_babilong}. %, since books and theatrical scripts are homogeneous and self-contained.
The second reason is that \textbf{the narrative domain represents a critical challenge} for natural language understanding systems~\citep{bonomo-etal-2025-literaryqa}. 
Unlike factoid QA tasks over short passages~\citep{joshi-etal-2017-triviaqa,yang-etal-2018-hotpotqa}, the narrative QA task requires models to maintain coherence across thousands of tokens, track character relationships, understand causal chains, and reason about plot developments. 

All QA items in the benchmark have been automatically generated either from the summary of the story, following the style of NarrativeQA~\citep{kočiský2017narrativeqareadingcomprehensionchallenge}, or from individual or grouped excerpts (\Cref{sec:data_description}). 

\paragraph{INDAQA2 - OEQA task}
The model is prompted with a simple instruction, followed by the entire content of the book and the question.
Then, the model is expected to generate a concise answer, which will be evaluated against a set of reference answers.
The number of reference answers, which spans from 1 to 5, depends on the difficulty of answering the question: questions which may allow for different formulations of the correct answers (i.e., paraphrases) have more references, so that the generated answers are scored fairly.

%insert example
\definecolor{eggshell}{RGB}{254,252,246}
\setlength{\fboxsep}{6pt}

% \begin{figure}[h]
% \begin{center}
% \noindent\fbox{\colorbox{eggshell}{%
%   \centering
%   \begin{minipage}{0.93\textwidth}
%   \ttfamily
%   \obeyspaces\obeylines
%   49\_la\_coscienza\_di\_zeno.local\_question.42\\
%   \\
%   Chi è Carla per Zeno?\\
%   - È la sua amante\\
%   - È la donna con cui ha una relazione extraconiugale. 
%   \end{minipage}
%     }
% }
% \end{center}
% \caption{Example of an Open-ended QA item from the book ``La coscienza di Zeno''. \textbf{Both reference answers are correct}.}
% \label{fig:open-ended_example}
% \end{figure}


\paragraph{INDAQA2 - MCQA task}
The model is prompted with a simple instruction, followed by the entire content of the book, the question, and four answer choices (three distractors and the correct answer). 
The choices are already shuffled, so that the correct answer can be found in any position with equal probability.
The model is expected to identify the correct answer and return the letter (i.e., the label) corresponding to the option deemed correct.
Since the correct answer is given as one of the four options, in this setting there is no need for a set of reference answers.

\begin{figure}[h]
\centering

\fbox{%
\begin{minipage}{0.9\textwidth}
\centering

\begin{minipage}{0.48\textwidth}
\centering
\colorbox{eggshell}{%
  \begin{minipage}{0.93\textwidth}
  \ttfamily
  \obeyspaces\obeylines
  la\_coscienza\_di\_zeno.oe.42\\
  \\
  Chi è Carla per Zeno?\\
  - È la sua amante\\
  - È la donna con cui ha una relazione extraconiugale.\\
  \end{minipage}
}
\end{minipage}
\hfill\vrule\hfill
\begin{minipage}{0.48\textwidth}
\centering
\colorbox{eggshell}{%
  \begin{minipage}{0.93\textwidth}
  \ttfamily
  \obeyspaces\obeylines
  la\_coscienza\_di\_zeno.mc.42\\
  \\
  Chi è Carla per Zeno?\\
  A. È un'amica di sua moglie\\
  B. È una parente\\
  \textbf{C. È la sua amante}\\
  D. È la sua cameriera
  \end{minipage}
}
\end{minipage}

\end{minipage}
}

\caption{Examples of a QA item from \emph{La coscienza di Zeno}. In the left side is the OE item, for which both reference answers are true. In the right side, is the MC item: only one of the four options is correct (marked in \textbf{bold}).}
\label{fig:open-ended-side-by-side}
\end{figure}


% \begin{figure}[h]
% \begin{center}
% \noindent\fbox{\colorbox{eggshell}{%
%   \centering
%   \begin{minipage}{0.93\textwidth}
%   \ttfamily
%   49\_la\_coscienza\_di\_zeno.local\_question.42\\
%   \\
%   Chi è Carla per Zeno?\\
%   A. È un'amica di Augusta, la moglie di Zeno\\
%   B. È una parente\\
%   \textbf{C. È la sua amante}\\
%   D. È una cameriera che lavora a casa di Zeno
%   \end{minipage}
%     }
% }
% \end{center}
% \caption{Example of a MCQA item from the book ``La coscienza di Zeno''. Only one of the four options is correct (marked in \textbf{bold}).}
% \label{fig:multi-choice_example}
% \end{figure}


\section{Data description}\label{sec:data_description}

% % goal: dato dei libri, avere un certo numero di QA. Since it is unfeasible, we generate and refine them, following a similar apporach to ...
% % we select x books from y sources (PG, WIki, LiberLiber, see section)
% % Each book has a corresponding wikipedia page, which in many cases contains a summary/plot section, which we use to generate QA items 
% % as LLM model we decide to use Gemini

% INDAQA2 contains 461 books collected from Project Gutenberg, Wikisource, and LiberLiber (\Cref{sec:data_origin}). 
% Since manually writing questions, reference answers and distractors (i.e., QA items) at book scale would be unfeasible, we use an LLM\footnote{\texttt{gemini-2.5-flash}} to generate (and refine) them as in~\citep{moroni-etal-2025-learned}. 
% %Each book has associated a set of QA items generated using LLMs (i.e., the associated questions and reference answers), and, 
% Depending on the context we provide to the LLM to ground the generation of the QA items (i.e., a summary or a passage), we divide the books into two data splits.
% Books in the \textbf{Summary-level} data split have their QA items generated from the summaries (\Cref{sec:summary-level_data}), while the books in the \textbf{Passage-level} split have their QA items generated from individual passages, as they have no available summaries (\Cref{sec:passage-level_data}).
% We also determine the appropriateness of the generated data by annotating a sample (\Cref{sec:annotation}).

% We describe the data format of the benchmark in~\Cref{sec:data_format} and provide the inference prompts used throughout the challenge in~\Cref{sec:inference_prompt} (prompts used to generate and refine the QA items are reported in~\Cref{appendix:prompts}).
% Detailed statistics on the resulting dataset are provided in~\Cref{sec:stats} and in~\Cref{appendix:stats}. 

Our goal is to create a large-scale question-answering benchmark grounded in Italian literary texts.
Building upon the 362 books from the original INDAQA, we extend the collection with 99 additional works, resulting in a total of 461 books sourced from Project Gutenberg, Wikisource, and LiberLiber (\Cref{sec:data_origin}).
We also collect introductory sections as well as summary or plot sections from the corresponding Wikipedia pages when available.
However, manually writing questions, reference answers, and distractors (collectively, QA items) at book scale is unfeasible.
Therefore, we rely on an LLM to generate and refine QA items, following an approach similar to~\citep{moroni-etal-2025-learned}.

Our QA generation is grounded in different types of textual context, which leads to two data splits. 
In the \textbf{Summary-level} split, QA items are generated from book summaries available on the corresponding Wikipedia pages (\Cref{sec:summary-level_data}). 
In the \textbf{Passage-level} split, QA items are instead generated from individual book passages, as no summaries are available for these works (\Cref{sec:passage-level_data}).

To assess the quality and appropriateness of the generated data, we conduct a manual annotation study on a representative sample of QA items (\Cref{sec:annotation}).

The data format and the inference prompts used in our evaluation are described in~\Cref{sec:data_format,sec:inference_prompt}. 
We provide detailed statistics in~\Cref{sec:stats} and~\Cref{appendix:stats}, while the prompts used for QA generation and refinement are available in~\Cref{appendix:prompts}.

\subsection{Origin of data}\label{sec:data_origin}
All documents in the benchmark have been downloaded from either Project Gutenberg, Wikisource or LiberLiber\footnote{\texttt{\texttt{https://www.gutenberg.org/}, \texttt{https://it.wikisource.org/}, https://liberliber.it/}}.
We also use Wikipedia to extract summaries and metadata (when available, see~\Cref{appendix:metadata}).  

\textbf{Project Gutenberg} is a volunteer-driven initiative dedicated to digitizing and preserving cultural works and to encouraging the creation and distribution of e-books. Founded in 1971, it is the oldest existing digital library. Its collection consists mainly of complete books or individual texts in the public domain. All materials are freely accessible and provided in open, non-proprietary formats compatible with nearly all computing platforms.

\textbf{Wikisource} is a project of the Wikimedia Foundation that aims to build a freely accessible online library of original source texts and their translations in any language. Unlike Project Gutenberg, Wikisource documents can be collaboratively edited and reviewed by contributors, allowing continuous improvement, verification against original sources, and transparent version tracking. As a result, the quality and reliability of texts generally increase over time.

\textbf{LiberLiber} is an Italian non-profit project focused on promoting free culture through the publication of literary, musical, and scholarly works in the public domain or released under free licences. Its digital library places great emphasis on textual accuracy, careful proofreading, and scholarly reliability. Often, documents hosted on LiberLiber are considered to be of higher editorial quality than those available on Project Gutenberg.

\subsection{Summary-level data split}\label{sec:summary-level_data}

\paragraph{Source texts}
The Summary-level data split contains 362 books for which a summary or plot section is available on Wikipedia, corresponding to the original INDAQA dataset~\citep{moroni-etal-2025-learned}. 
During our analysis, we found that several books contained incomplete or corrupted content.
For instance, many theatrical screenplays were missing character names in dialogues, while others exhibited missing or duplicated sections.
These issues were likely caused by failures of the HTML parser when applied to heterogeneous, crowdsourced texts that lack a standardized structure.
It is indeed difficult to devise a parsing method which accounts for different HTML formatting, as one effective for one document could produce corrupted outputs for others.
To address this issue, we manually downloaded and reviewed the source texts for all documents, ensuring high quality across the dataset\footnote{We downloaded the text directly in \texttt{.txt} or \texttt{.rtf} formats, avoiding HTML pages.}.

\paragraph{QA items generation}
Since the QA items were generated from the summaries, which we found to be of high quality, and manually validated in~\citep{moroni-etal-2025-learned}, we do not modify them. 
\newline

\noindent 
The resulting Summary-level section of INDAQA2 contains the original summaries, questions and answers, with clean source books for each QA item.
\Cref{fig:summary-level_sample} in~\Cref{appendix:examples} shows an example of a Summary-level QA item as a \verb|JSON| object.


\subsection{Passage-level data split}\label{sec:passage-level_data}
\paragraph{Source texts}
While valuable, the Summary-level data split mainly accounts for questions either about the whole narrative (\textit{abstractive}) or about information for which the answer can be found in various passages.
To test the ability of the models to retrieve more localized and specific details, we extend the original 362 documents with a  
collection of another 99 books that were discarded in the previous INDAQA release since they were lacking a summary.
These books were downloaded from the same sources and followed the same approach of the aforementioned data split. 

\paragraph{QA item generation}
Lacking a summary, we devised three methods for generating QA items with an LLM starting from individual passages.
These methods categorize the questions into three sets:
\begin{enumerate}
    \item \textbf{Local}: questions generated from a single passage (defined as 20 contiguous sentences) randomly selected at runtime (prompt in~\Cref{fig:generation_prompt_local-level}).
    These questions typically focus on specific details stated in the provided passage.
    
    \item \textbf{Local (alternative)}: questions generated from a single passage plus the previously generated \textit{Local} items (prompt in~\Cref{fig:generation_prompt_local-level-alt}). 
    We noticed that this generation setting encourages the model to avoid repetition and generate less straightforward questions, shifting the distribution of question types (see Figure~\ref{fig:first_words}).
    
    \item \textbf{Entity}: questions generated from three passages in which an entity consistently appears, selected from the beginning, middle, and ending sections of the documents (prompt in~\Cref{fig:entity_question_prompt}). Entities are identified by extracting all capitalized names (excluding \textit{stopwords}) from the questions in the two previously defined sets. We clustered the entities by a simple exact string match.
    These capitalized terms capture recurring entities such as characters, locations, or organizations that are central to the narrative.
    For each entity cluster, we select three passages where the entity appears. 
    Entities with fewer than three occurrences or very low frequency are filtered out.
    The resulting clusters are manually validated to remove noise (e.g., common words erroneously capitalized, non-entities).
    Finally, the selected passages and their associated local questions are provided to the LLM to generate questions targeting overarching plot elements, character development, or thematic connections across the narrative.
\end{enumerate}

\paragraph{MCQA conversion}
Due to their localized scope (i.e., a single passage), \textit{Local} and \textit{Local (alternative)} QA items are well suited to a multiple-choice format.
For each item in these two sets, we provide an LLM with the source passage and the generated question with reference answers, and instruct it to produce three plausible distractors, preferably grounded in the given context (prompt in~\Cref{fig:mc_conversion_prompt}).
We also tried to convert the \textit{Summary} and \textit{Entity} questions to a multiple-choice format.
However, we found that generating plausible distractors from summaries or multiple distant passages resulted in hallucinations or generally weak, low-quality distractors. 
Hence, only the two \textit{Local} sets support the MCQA format.

\paragraph{QA items refinement}
Following the original methodology for the INDAQA dataset, we refine the \textit{Local} QA items by asking an LLM to assess whether the questions are well posed and answerable, and whether the corresponding answers are acceptable.
We exclude \textit{Entity} questions from this correction step because their higher complexity could not be reliably handled by the tested LLMs.
\Cref{fig:correction_prompt} shows the prompt we used for this task, which we built following guidelines similar to those of~\citep{bonomo-etal-2025-literaryqa}.
The refinement process identified 49 QA items with problems: given their low number, we manually review the LLM correction and substitute the refined QA items in the benchmark. 
\newline

\noindent 
Throughout our experiments, we used Gemini-2.5-Flash~\citep{comanici2025gemini25pushingfrontier} to generate, refine and convert QA items.
An example QA item for each question set can be found in~\Cref{appendix:examples},~\Cref{fig:Passage-level_sample_local,fig:Passage-level_sample_alt,fig:Passage-level_sample_entity}.


\subsection{Annotation details}\label{sec:annotation}
To assess the quality and validity of the generated QA items, we conduct human validation on a representative sample of the corpus.
We focus on the newly added documents\footnote{The original INDAQA has an error rate of the generated QA items of 2.32\%, which we deem acceptable.}, following similar annotation guidelines as in the original INDAQA~\citep{moroni-etal-2025-learned}. 

\paragraph{Subset Selection}
The Passage-level split comprises 11,560 QA items across 99 books, of which 10,187 support both free-form and multiple-choice evaluation (the two \textit{Local} question sets), while the remainder consists of only open-ended questions (the \textit{Entity} question set). 
We target approximately 5\% of the total dataset for human annotation. 
We adopt a stratified sampling strategy to select the books in the annotation set.
Books are divided into 20 equal-probability bins based on text length quantiles, and from each bin, the book whose length is closest to that bin's mean length is selected (\Cref{fig:annotation_set}).
We focus on 20 bins (and so, 20 books) to ensure manageable scope while maintaining diversity.
This approach ensures representative coverage across the entire range of text lengths, avoiding potential biases. 

 
Then, we randomly sample the QA items from the set of 20 books to mirror the distribution of the three question sets: 
%
400 items from \textit{Local} set             (20 items per book), 
120 items from \textit{Local alternative} set (6 items per book), and 
60  items from \textit{Entity} set            (3 items per book), 
%
yielding a total of 580 elements to review.

To measure inter-annotator agreement (IAA), we randomly select 100 overlapping items from the annotation subset (\textasciitilde17\%).
This overlap also mirrors the overall QA item distribution: approximately 70\% from the \textit{Local} set, 20\% from the \textit{Local (alternative)} set, and 10\% from the \textit{Entity} set. 

\paragraph{Annotation Guidelines}
The items in the annotation sample were independently validated by two expert annotators (either native or proficient in Italian).
Annotators were asked to assess the quality of automatically generated QA items for Italian long-context narratives. 
For each of them, annotators evaluated the following dimensions:

\begin{itemize}
    \item \textbf{Fluency}: Whether the question, correct answers, and eventual distractors are grammatically correct and naturally phrased in Italian.
    \item \textbf{Validity}: Whether the item elements are appropriate and accurate:
    \begin{itemize}
        \item The question is clear and answerable given the source text
        \item The reference answers are factually correct
        \item The distractors are plausible but incorrect (if present)
    \end{itemize}
\end{itemize}

Annotators were encouraged to note any unusual patterns, ambiguities, or issues worthy of further investigation.

\paragraph{Result} 
After the annotation phase, we observed high inter-annotator agreement, with a Cohen’s Kappa of 0.7563. 
The average error rate in the dataset, defined as the proportion of non-acceptable items among the annotated sample, is 4.74\%.
From the annotation process, we observed that the generated questions were always well-posed and answerable from the provided context alone. 
The errors focused mostly on the reference answers, in particular on the second reference, which was sometimes inaccurate or wrong, while the first was always correct.
Regarding the distractors, they were generally plausible, albeit sometimes weak.
We discuss the implications of these quality observations for our evaluation methodology in~\Cref{sec:limitations}.

\subsection{Data format}\label{sec:data_format}
The benchmark is freely available through the Hugging Face repository\footnote{\texttt{https://huggingface.co/datasets/sapienzanlp/INDAQA\_CALAMITA}}. 
All items have the same data fields, but depending on the data section and question set, some may be empty.
%The schema is the following:
\small
\begin{itemize}
  \item \textbf{id} (\texttt{str}) : unique identifier for the document
  \item \textbf{text} (\texttt{str}) : text of the document
  \item \textbf{qas} (\texttt{list[dict]}) : QA entries associated with the document
  \begin{itemize}
    \item \textbf{question\_id} (\texttt{str}) : unique ID for the QA item
    \item \textbf{question} (\texttt{str}) : the question text
    \item \textbf{answers} (\texttt{list}) : list of free-form reference answers
    \item \textbf{choices} (\texttt{list}) : list of MCQA options
    \item \textbf{target} (\texttt{dict}) :
    \begin{itemize}
      \item \textbf{label} (\texttt{str}) : correct MCQ label (\verb|'A'|, \verb|'B'|, \verb|'C'|, or \verb|'D'|) 
      \item \textbf{text} (\texttt{str}) : canonical correct answer (i.e., the first reference)
    \end{itemize}
    \item \textbf{entity} (\texttt{str}) : entity targeted by the question, if present
    \item \textbf{model} (\texttt{str}) : generator model used
    \item \textbf{kind} (\texttt{str}) : question type (\textit{Local}, \textit{Local Alternative} or \textit{Entity})
    \item \textbf{source\_paragraphs\_ids} (\texttt{list}) : list of paragraph indices used to generate the QA
    \item \textbf{source\_questions\_ids} (\texttt{list}) : list of related question indices
  \end{itemize}
  \item \textbf{metadata} (\texttt{dict}) : book-level metadata
  \begin{itemize}
    \item \textbf{title} (\texttt{str}) : title of the work
    \item \textbf{author} (\texttt{str}) : author name
    \item \textbf{year} (\texttt{int}) : publication year
    %\item \textbf{genres} \texttt{list[str]} : main literary genres
    %\item \textbf{subgenres} \texttt{list[str]} : granular genre tags
    \item \textbf{summary} (\texttt{str}) : book summary used in summary\_level
    \item \textbf{summary\_length} (\texttt{int}) : length of the summary (in words)
    \item \textbf{text\_length} (\texttt{int}) : length of text (in words).
    \item \textbf{source\_link} (\texttt{str}) : link to the text source
    \item \textbf{summary\_link} (\texttt{str}) : link to the summary source
    \item \textbf{qa\_paragraphs} (\texttt{list[str]}) : list of text chunks used to generate the QAs
  \end{itemize}
\end{itemize}
\normalfont

We also report one example per data split and question set in Appendix~\Cref{appendix:examples},~\Cref{fig:summary-level_sample,fig:Passage-level_sample_local,fig:Passage-level_sample_alt,fig:Passage-level_sample_entity}.


% \subsection{Data statistics}
% \paragraph{Summary-level set}
% The summary-level set comprises 13,661 open-ended QA items, spanning over 362 documents, with an average of \textasciitilde38 QA items per book.
% The length of the documents goes from a minimum of \textasciitilde500 words to a maximum of \textasciitilde242K words, with an average of 26K \pm 33K.
% Due to the cleaning process of the original INDAQA dataset, the new dataset resulted in a smaller size in terms of words.
% %Due to the cleaning process of the original INDAQA dataset, the new dataset resulted in a smaller size (in terms of words), eliminating a total of \textasciitilde550K words.

% \paragraph{Passage-level set}
% The Passage-level set comprises 13,661 open-ended QA items, spanning over 99 documents.
% Due to the generation process, the majority of QA items in this set belong to the Local Question set, which on average has 80 \pm 14 items.
% Then, the Alternative Local Question set, with 23 \pm 5 QA items, and finally the smallest set, the Entity Question set, with 14 \pm 6 QA items.
% Of these sets, only the first two supports MCQA, as generating plausible distractors for this kind of questions proved very difficult.
% The length of the documents goes from a minimum of \textasciitilde8K words to a maximum of \textasciitilde188K words, with an average of 58K \pm 31K.

\subsection{Example of prompts used for zero or/and few shots}\label{sec:inference_prompt}
The challenge is supposed to be accomplished in a zero-shot setting with a very simple prompt.
We show in~\Cref{fig:inference_prompt} the prompt used for inference (in the generative task, \texttt{choices\_block} is null).

\subsection{Detailed data statistics}\label{sec:stats}

Tables~\ref{tab:dataset_stats} and~\ref{tab:qa_type_stats} report statistics for the Summary- and Passage-level splits of the INDAQA2 dataset.
Specifically, Table~\ref{tab:dataset_stats} presents document-level statistics divided by data split, including the number of documents, the average number of QA items per document, and text length. Table~\ref{tab:qa_type_stats}, instead, summarizes the distribution of question types (Summary, Local, and Entity) across the dataset. In Figure~\ref{fig:doc_len} we plot the length distribution of both data splits. Additionally, in Figure~\ref{fig:first_words} we plot the question categorization (by first word) distribution across question types.
Further statistics are presented in~\Cref{appendix:stats}. 

\paragraph{Summary-level}
This split comprises 13,661 open-ended QA items spanning 362 documents, averaging approximately 38 QA items per book. 
Due to the cleaning process applied to the original INDAQA dataset, document lengths were reduced by about \textasciitilde3\% while preserving all narrative content, from an average of 27K $\pm$ 38K words to an average of 26K $\pm$ 33K words.
%(from 26,607 $\pm$ 38,174 words to 25,891 $\pm$ 33,033 words).

\paragraph{Passage-level}
This split comprises 11,560 open-ended QA items spanning 99 documents. 
While containing fewer documents, it still has a comparable number of QA items with respect to the Summary-level split through a higher density of QA items. 
Notably, Passage-level documents are more than double the length of \textit{Summary-level} ones on average, requiring models to process and reason over more extensive textual contexts.

\begin{table}[t]
\centering
\begin{tabular*}{\textwidth}{@{\hspace{0.5cm}\extracolsep{\fill}}lccc@{\hspace{0.5cm}}}
\toprule
\textbf{Metric} & \textbf{Summary-level set} & \textbf{Passage-level set} & \textbf{Both sets}\\
\midrule
\# Documents      & 362 & 99 & 461 \\
\# QA items     & 13,661 & 11,560 & 25,221 \\
\# QA items/doc & 38 $\pm$ 2 & 117 $\pm$ 20 & 55 $\pm$ 34 \\
\midrule
%\textit{Document length (words)} & &  \\
Text length average & 26K $\pm$ 33K & 58K $\pm$ 31K & 33K $\pm$ 35K \\
Text length range   & 0.5K\texttt{-}242K &  8K\texttt{-}188K & 0.5K\texttt{-}242K \\
\bottomrule
\end{tabular*}
\caption{
    Statistics for the documents in INDAQA2 divided by data split. 
    }
\label{tab:dataset_stats}
\end{table}

\begin{table}[t]
\centering
\begin{tabular*}{\textwidth}{@{\hspace{0.5cm}\extracolsep{\fill}}lcccc@{\hspace{0.5cm}}}
\toprule
\textbf{Question Type} & \textbf{\# Items} & \textbf{\# Items/doc} & \textbf{Question length} & \textbf{Answer length} \\
\midrule
\multicolumn{4}{l}{\ \ \ \ \textit{Summary-level split}} \\
\quad Summary &  13,661  (100\%) &   38 $\pm$ ~~2 & ~~7 $\pm$ 2 & ~~5 $\pm$ 3 \\
\midrule
\multicolumn{4}{l}{\ \ \ \ \textit{Passage-level split}} \\
\quad Local & ~~7,901 ~~(68\%) &  80 $\pm$  14 & ~~8 $\pm$ 2 & ~~4 $\pm$ 2 \\
\quad Local (alternative) & ~~2,286 ~~(20\%) & 23 $\pm$ ~~5 & ~~9 $\pm$ 3 & ~~6 $\pm$ 4 \\
\quad Entity & ~~1,373 ~~(12\%) & 14 $\pm$ ~~6 &  13 $\pm$ 3 &  24 $\pm$ 8 \\
% \midrule
% \multicolumn{4}{l}{\textit{Summary- \& Passage-level sets}} \\
% \quad Total & 25,221~~~~~~~~~~~ & 55 $\pm$ 34 & ~~8 $\pm$ 3 & ~~5 $\pm$ 4 \\
\bottomrule
\end{tabular*}
\caption{
    QA item distribution and length statistics by question type. 
    Entity Questions feature notably longer answers (average 24 words vs. 4–6 words for other types), while Local Questions dominate the Passage-level set (68\%). 
    Question and answer lengths are measured in words; percentages represent proportions within each set.}
\label{tab:qa_type_stats}
\end{table}

% \begin{table}[t]
% \centering
% \begin{tabular}{lrrrr}
% \toprule
% \textbf{Question Word} & \textbf{Summary} & \textbf{Local} & \textbf{Local (Alt.)} & \textbf{Entity} \\
% \midrule
% \textit{Cosa} (What)     & 4,363 & 1,335 & 694 & 222 \\
% \textit{Chi} (Who)       & 3,513 & 2,251 & 248 & --- \\
% \textit{Quale} (Which)   & 2,537 & 2,203 & 709 & 799 \\
% \textit{Come} (How)      & 1,441 &   724 & 253 & 348 \\
% \textit{Dove} (Where)    & 1,099 &   838 &  68 & --- \\
% \textit{Perché} (Why)    &   365 &   234 &  41 & --- \\
% \textit{Quanto} (How much) & 142 &   136 & 161 & --- \\
% \textit{Quando} (When)   &    29 &    35 &  25 & --- \\
% \bottomrule
% \end{tabular}
% \caption{Distribution of question-initial words across question types (after stopword removal). English translations provided in parentheses.}
% \label{tab:question_words}
% \end{table}

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{document_length_distribution.pdf}
    \caption{
        Document Length distribution across the two splits of INDAQA2.
        While the distribution of the original dataset is skewed towards shorter documents, the new data split is more balanced.
    }
    \label{fig:doc_len}
\end{figure}

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{first_word_distribution_all.pdf}
    \caption{Distribution of first words of the questions. Each set has a different percentage of questions starting with a given first word.}
    \label{fig:first_words}
\end{figure}



\section{Metrics}\label{sec:evaluation}
We employ different evaluation methodologies for the two settings in INDAQA2: \textit{n}-gram-based metric for generative questions and accuracy for multiple-choice questions.
In practice, as with all CALAMITA challenges, evaluation is carried out using the LM Evaluation Harness framework developed by EleutherAI.\footnote{\url{https://github.com/EleutherAI/lm-evaluation-harness}}

\paragraph{OEQA Evaluation}
For the generative task of INDAQA2, we employ \textbf{Exact-Match} (EM) and \textbf{METEOR}~\citep{banerjee-lavie-2005-meteor,lavie-agarwal-2007-meteor} as automatic evaluation metrics.
For both metrics, we take the maximum score between the candidate answer and the set of reference answers as the score of the model for that QA item.

EM is a simple and intuitive measure that checks whether a reference answer appears verbatim within the generated output. However, this metric may fail to capture semantically correct responses that do not exhibit lexical overlap with the reference. To address these well-known limitations of EM, we additionally adopt METEOR. 

METEOR computes an alignment-based score between generated and reference answers by considering exact, stem, synonym, and paraphrase matches. Unlike simpler \textit{n}-gram-based metrics such as ROUGE, METEOR jointly models precision and recall while leveraging linguistic knowledge (e.g., synonyms) from WordNet, making it particularly suitable for evaluating natural language answers with different but equivalent surface forms.
While there is not a standard metric for the evaluation of open-ended QA tasks, we follow prior work that demonstrates METEOR's superior reliability in this domain~\citep{bonomo-etal-2025-literaryqa}.

\paragraph{MCQA Evaluation}
For the MCQA component of INDAQA2, we follow best practices for multiple-choice evaluation proposed in recent works, in which model outputs are evaluated using manually curated regular expressions \cite{molfese-etal-2025-right,wang-etal-2024-answer-c}. 
The set of regex patterns is constructed by inspecting the outputs of a subset of models and identifying the most frequent surface forms used to indicate the selected option. 
This approach follows the analysis and setup of \citep{molfese-etal-2025-right}, which shows that regex-based answer extraction is more reliable than perplexity-based evaluation pipelines that rely on next-token probabilities over candidate choices. 
Accuracy is computed as the proportion of questions for which an extracted pattern corresponds to the correct answer. 
The complete list of regex patterns is provided in Appendix~\ref{appendix:regexes}.

\paragraph{Context Truncation} INDAQA2 contains documents of highly variable length, reaching up to hundreds of thousands of words (see Figure~\ref{fig:doc_len}), which make the benchmark both memory- and computation-intensive. 
To have an affordable evaluation, we define multiple context-size settings by truncating each book to its first words\footnote{We simply split on whitespace.}. 
Specifically, besides evaluating on the full book, we also test the models under two context lengths: 10K and 50K words. 
This setup allows us to assess system performance not only under partial but manageable contexts, but also in terms of efficiency in word splitting. 
Models that require fewer splits (i.e., having a lower fertility) are better suited to incorporate partial or full contextual information within their context window.

\section{Results, Challenge-Specific Insights and Lessons Learned}

% \textcolor{red}{\textbf{TODO}}
% Following the model adopted in \url{https://arxiv.org/pdf/2512.04759}, describe and critically discuss here the results for your Challenge shared by the evaluation team, according to the following structure.

% Refer to the CALAMITA leaderboard for the official results (for all the accepted challenges, results will be published here): \url{https://github.com/CALAMITA-AILC/calamita-eval/blob/main/results/README.md}

We analyze and report the results of the INDAQA2 challenge, examining the performance of LLMs featured on the CALAMITA leaderboard. Our analysis is complemented by a qualitative discussion of a sample of generated outputs, as well as a critical assessment of current limitations and future challenges in Italian narrative understanding.

\paragraph{Performance across models (types, sizes)}
%non-native, fine-tuned native (ANITA), natively native (Minerva)
\Cref{tab:results_1K} reports the performance of the five instruction-tuned models evaluated in the INDAQA2 challenge. Specifically, we consider two Llama models—Llama3.1-8B and Llama3.1-70B \cite{grattafiori2024llama3herdmodels}—the ANITA-8B model \cite{polignano2024advancednaturalbasedinteractionitalian}, and two variants of Minerva: Minerva-7B \cite{orlando-etal-2024-minerva} and Minerva-7B$_L$ \cite{moroni-etal-2025-learned}.
Minerva-7B$_L$ is an extension of Minerva-7B that underwent continual training to increase the maximum context length from 4K to 32K tokens. By comparison, ANITA-8B natively supports an 8K token context window, while both Llama models handle input sequences of up to 128K tokens.
% For the 10-K word context, Llama 3.1-70B achieved the best overall results across all sets. However, Minerva-7B-Long performed better on specific metrics, namely EM for the summary set and METEOR for the entity set. Minerva-7B-Long also remained competitive, particularly on METEOR for the summary set and the local set. Regarding the 50K-word context, Llama 3.1-70B outperforms all other models across every set and metric, except for the summary set on EM, where Llama 3.1-8B achieved the best result.

Among the 7-8B models, performance patterns vary considerably across context sizes and task types.
In the 10K-word context setting, Llama3.1-8B achieves the strongest results on the \textit{Local} and \textit{Local alternative} multiple-choice tasks, reaching 43.83\% and 43.15\% accuracy respectively.  
Interestingly, the pattern reverses for the more challenging open-ended generation tasks: Minerva-7B$_L$ outperforms all other 7-8B models on the \textit{Summary}, \textit{Local}, and \textit{Entity} sets when measured by METEOR and EM scores.
This suggests that Italian-specialized models may retain an advantage for generative tasks requiring nuanced language production, even when the multilingual model are better at discriminative multiple-choice selection. Notably, Minerva-7B$_L$ consistently surpasses both the English-centric Llama models and the other Italian-specific alternatives (ANITA-8B and the previous version Minerva-7B) on these open-ended tasks. 

When context is extended to 50K words, Llama3.1-8B and Llama3.1-70B show large improvements on all the tasks, due to their capabilities in handling longer contexts.
This is especially evident in the MC task, with an increment of more than 20 percentage points. 
In contrast, the corresponding METEOR scores for the open-ended \textit{Local} set increase by only 12 points.
This pattern suggests that Llama models are particularly adept at leveraging extended context for multiple-choice tasks, where the presence of answer options provides additional grounding that helps the model locate and select the correct response.
For open-ended generation, however, the benefits of longer context are more modest: while the model may identify relevant information, accurately producing well-formed Italian answers remains challenging.
% We focus on METEOR rather than EM for this comparison, as EM is overly sensitive to minor lexical variations and does not reliably reflect improvements in answer quality.

Italian-specialized models show minimal improvement on 50K words and full book settings across both task formats, with ANITA and Minerva variants demonstrating performance nearly identical to their 10K results. This stagnation is due to their limited context window with respect to Llama models.

With full book context available, Llama3.1-8B maintains its strong performance while showing only incremental gains over the 50K setting.
This plateau suggests that 50K words may already encompass most information relevant to answering the benchmark questions, with additional context providing diminishing returns.

As expected, Llama3.1-70B substantially outperforms all smaller models across all metrics and context settings, with performance gains becoming more pronounced as context length increases.
The model's higher parameter count enables it to better exploit the additional information provided by longer contexts.
Due to computational constraints, we were unable to evaluate Llama3.1-70B on the full book setting.

\paragraph{Error Analysis}
% We analyse a set of selected qualitative examples to better understand the failure modes of the evaluated models across different evaluation settings. Table~\ref{tab:qualitative_analysis} reports the outputs of the tested models. The results show that the METEOR score is able to effectively capture answer quality: Minerva-7B$_{L}$, which produces a more appropriate response thanks to its extended context, achieves a higher METEOR score, whereas ANITA and Llama-8B obtain lower scores.
% Table~\ref{tab:qualitative_analysis_mc} presents a qualitative comparison in which Minerva-7B$_{L}$ is prompted with the same question from the \textit{Local} set under both OE and MC settings. 
% While the model successfully answers the question in the OE setting, it fails in the MC setting. 
% This behaviour is consistent with the quantitative results over the full dataset, where Minerva-7B${L}$ achieves strong performance in the OE setting but struggles in the MC setting. 
% These findings align with those of \citet{moroni-etal-2025-learned}, suggesting that OE and MC evaluations probe different aspects of generative models.

To further investigate the task-specific performance patterns observed in the quantitative results, we conduct a qualitative error analysis on selected examples from both evaluation settings.
\Cref{tab:qualitative_analysis} illustrates model outputs for question from the \textit{Local} set in the OE task.
The results demonstrate that, in this example, METEOR scores effectively capture answer quality: Minerva-7B$_{L}$ produces a more general but correct response, achieving a higher METEOR score.
In contrast, ANITA and Llama-8B generate less accurate or incomplete answers, reflected in their lower scores.
This example corroborates our earlier finding that Italian-specialized models maintain advantages in open-ended generation tasks when they can effectively access relevant context.

\Cref{tab:qualitative_analysis_mc} presents a particularly revealing case where Minerva-7B$_{L}$ is prompted with the same question under both OE and MC formats.
While the model successfully generates the correct answer in the OE setting, it selects an incorrect option in the MC setting.
This failure mode is consistent with the quantitative patterns discussed above: Minerva-7B$_{L}$ demonstrates good performance on open-ended tasks but struggles with multiple-choice selection.
The contrast highlights a fundamental difference in how these evaluation formats probe model capabilities, a finding that aligns with observations by \citep{moroni-etal-2025-learned}.
While multiple-choice tasks primarily assess whether models can leverage answer options to discriminate between alternatives (a capability where Llama models excel), OE tasks require models to independently generate well-formed responses in the target language, where Italian-specialized models show relative strength.

% anita - OE
% {"doc_id": "01_al_di_la", "doc_question": "Dove stava andando la marchesa?", "doc_answers": ["A Bologna", "Verso la citt\u00e0 di Bologna"], "exact_match": 0, "meteor": 0.31779661016949157, "resps": [["Non andava da nessuna parte, stava seduta in un salotto di Bologna."]]}

% Llama - OE
% {"doc_id": "01_al_di_la", "doc_question": "Dove stava andando la marchesa?", "doc_answers": ["A Bologna", "Verso la citt\u00e0 di Bologna"], "exact_match": 0, "meteor": 0.13333333333333336, "resps": [["La marchesa stava andando a pranzo da Giorgio, il conte De Vinci, e poi avrebbe accompagnato Mimy, la moglie di Carlo, a pranzo da lei."]]}


% Minerva - OE
% {"doc_id": "01_al_di_la", "doc_question": "Dove stava andando la marchesa?", "doc_answers": ["A Bologna", "Verso la citt\u00e0 di Bologna"], "exact_match": 1, "meteor": 0.75, "resps": [[" La marchesa stava andando a Bologna."]]}

% -----

% Minerva - OE 
% {"doc_id": "02_alle_porte_d_italia", "doc_question": "Da quale citt\u00e0 provenivano le commissioni per estirpare l'eresia valdese?", "doc_answers": ["Da Torino", "Dalla citt\u00e0 di Torino"], "exact_match": 1, "meteor": 0.8928571428571429, "resps": [[" Da Torino."]]}

% Minerva - MC
% {"doc_id": "02_alle_porte_d_italia", "doc_question": "Da quale citt\u00e0 provenivano le commissioni per estirpare l'eresia valdese?", "doc_choices": ["A. Da Carignano", "B. Da Ginevra", "C. Dalla Francia", "D. Da Torino"], "doc_target": "D", "accuracy": 0, "resps": [[" B. Da Ginevra"]]}



\begin{table}[t]
\centering
\begin{tabular}{
p{0.3\textwidth} 
p{0.50\textwidth} 
c
}
\toprule
\textbf{QA item} & \textbf{Model answers} & \textbf{METEOR} \\
\midrule
\parbox[t]{0.3\textwidth}{
\textbf{Question}\\
Dove stava andando la marchesa?
}
& \textit{ANITA-8B}: Non andava da nessuna parte, stava seduta in un salotto di Bologna. & 0.31 \\
\cmidrule(lr){2-3}
\parbox[t]{0.35\textwidth}{
\textbf{References}\\
-- A Bologna\\
-- Verso la città di Bologna
}
& \textit{Llama-8B}: La marchesa stava andando a pranzo da Giorgio, il conte De Vinci, e poi avrebbe accompagnato Mimy, la moglie di Carlo, a pranzo da lei. & 0.13 \\
\cmidrule(lr){2-3}
& \textit{Minerva-7B\textsubscript{L}}: La marchesa stava andando a Bologna. & 0.89 \\
\bottomrule
\end{tabular}
\caption{Model responses on a Local question in the OE task, paired with their METEOR score.}
\label{tab:qualitative_analysis}
\end{table}



\begin{table}[t]
\centering
\begin{tabular}{
p{0.45\textwidth}
p{0.25\textwidth}
p{0.2\textwidth}
}
\toprule
\textbf{QA item} & \textbf{Model answers} & \textbf{Score} \\
\midrule
\parbox[t]{0.4\textwidth}{
\textbf{Question}\\
Da quale città provenivano le commissioni per estirpare l'eresia valdese?\\[0.5ex]
\textbf{Choices}\\
A. Da Carignano\\
B. Da Ginevra\\
C. Dalla Francia\\
\textbf{D. Da Torino}\\
}
&
\parbox[t]{0.2\textwidth}{
\textit{OE setting}\\
Da Torino\\[1ex]
\textit{MC setting}\\
B. Da Ginevra
}
&
\parbox[t]{0.2\textwidth}{
METEOR: 1.00\\\\[1ex]
Correct: False
}
\\
\bottomrule
\end{tabular}
\caption{Comparison of the Minerva-7B\textsubscript{L} responses on a Local question in the OE task and MC task, paired with their evaluation.
In the OE task, the model correctly responds, while in the MC task, it is confounded by the four options and chooses the wrong one.
The choice in \textbf{bold} is the correct response, which corresponds to the one of the reference answers in the OE task.}
\label{tab:qualitative_analysis_mc}
\end{table}



\paragraph{Critical analysis/discussion/future}

% expected vs unexpected outcomes
Our results carry important implications for the development of Italian language models and long-context evaluation more broadly.
Current Italian-specialized models struggle with extended context utilization, suggesting that architectural innovations beyond continued pretraining may be necessary.
Future work should explore whether techniques such as modified attention mechanisms, retrieval-augmented approaches, or different positional encoding schemes can better enable smaller Italian models to leverage long contexts.
Moreover, the disparity between MC and OE performance raises questions about evaluation methodology: while both formats provide valuable signals, developing evaluation frameworks that better disentangle retrieval capabilities from generation quality could yield more actionable insights for model development.

\begin{table}[t]
\centering
\begin{tabular*}{\textwidth}{lccccccccc}
\toprule
\multirow{2}{*}{\quad\textbf{Model}} 
& 
\multicolumn{2}{c}{\textbf{Summary set}} & 
\multicolumn{3}{c}{\textbf{Local set}} & 
\multicolumn{3}{c}{\textbf{Local alt. set}} & 
%\multicolumn{2}{c}{\textbf{Entity set}} 
\textbf{Entity set}
\\
\cmidrule(r){2-3}
\cmidrule(r){4-6}
\cmidrule(r){7-9}
\cmidrule(l){10-10}
& EM & METEOR & EM & METEOR & ACC & EM & METEOR & ACC & METEOR \\ 
\midrule
% \multicolumn{10}{l}{\textit{1K words context}}\\
% \midrule
% $\quad\text{Llama3.1-8B}$ &  &  &  &  &  &  &  &  &    \\
% $\quad\text{Llama3.1-70B}$ & & & & & & & & &  \\
% $\quad\text{ANITA}$ & & & & & & & & &  \\
% $\quad\text{Minerva-7B}$ & & & & & &  & & & \\
% $\quad\text{Minerva-7B}_{L}$ & 4.18 & 16.66 & 6.20 & 13.49 & 27.64 & 3.06 & 10.45 & 26.30 & 21.07 \\
% \midrule
\multicolumn{10}{l}{\textit{10K-word context}}\\
\midrule
$\quad\text{ANITA-8B}$          & 3.92          & 23.12          & 10.38           & 14.92          & 38.08           & ~~4.16          &         11.47  & 36.37  & 20.15 \\
$\quad\text{Minerva-7B}$     & 3.24          & 25.37          & ~~8.15          & 16.98          & 28.57           & ~~2.89          &         12.56  & 25.08  & 21.56 \\
$\quad\text{Minerva-7B}_{L}$ & \textbf{5.08} & \textbf{26.93} & 11.87           & \textbf{20.31} & 33.52           & ~~6.08          &         15.31  & 27.48  & \textbf{22.21} \\
$\quad\text{Llama3.1-8B}$    & 4.46          & 25.60          & \textbf{14.44}  & 18.66          & \textbf{43.83}  & ~~\textbf{9.89} & \textbf{16.65} & \textbf{43.15}  & 20.83 \\
\midrule
$\quad\text{Llama3.1-70B}$   & 3.41 & \underline{27.62} & \underline{17.04}  & \underline{23.24} & \underline{49.57} & \underline{11.72}  & \underline{19.71} & \underline{50.02} & 20.01 \\
\midrule
\multicolumn{10}{l}{\textit{50K-word context}}\\
\midrule
$\quad\text{ANITA-8B}$       &         4.26  &         24.06  &          11.37 &         15.36 &          39.46 &         ~~4.86  & 11.79   & 37.64   &         20.57 \\
$\quad\text{Minerva-7B}$  &         3.45  &         25.81  &         ~~8.37 &         17.02 &           29.91 &         ~~2.97  & 12.64    & 25.95    &         21.46 \\
$\quad\text{Minerva-7B}_{L}$  &     5.30  &         27.80  &          13.10 &         21.05 &          33.91 &         ~~5.99  & 15.07   & 26.43   &         22.17 \\
$\quad\text{Llama3.1-8B}$ & \textbf{5.45} & \textbf{28.42} &  \textbf{29.44} & \textbf{30.60} & \textbf{64.98} &  \textbf{26.20} & \textbf{34.16} & \textbf{66.91} & \textbf{22.87} \\
\midrule
$\quad\text{Llama3.1-70B}$ & 3.21 & \underline{29.38} & \underline{36.22} & \underline{41.75} & \underline{78.80} & \underline{32.94} & \underline{45.23} & \underline{80.17} & \underline{24.35} \\
\midrule
\multicolumn{10}{l}{\textit{Full book context}}\\
\midrule
$\quad\text{ANITA-8B}$ &              ~~4.21  &         25.38  &         10.99  &         15.51  &         39.50  & ~~4.72                  & 12.04  & 37.68          & 19.55 \\
$\quad\text{Minerva-7B}$ &         ~~3.61  &         26.28  &         ~~7.68 &         16.71  &         30.42  & ~~2.62                  & 12.34  & 26.12          & 21.31 \\
$\quad\text{Minerva-7B}_{L}$ &     ~~5.42  &         27.94  &         12.94  &         20.75  &         33.54  & ~~5.90                  & 15.11  & 26.47          & 22.10 \\ 
$\quad\text{Llama3.1-8B}$ &~~\textbf{5.51} & \textbf{30.74} & \textbf{31.73} & \textbf{34.24} & \textbf{67.77} &  \textbf{28.39} & \textbf{37.87} & \textbf{70.15} & \textbf{22.77} \\
\midrule
$\quad\text{Llama3.1-70B}$ & - & - & - & - & - & - & - & - & - \\
\bottomrule
\end{tabular*}
\caption{Results of four models on INDAQA2. We highlight in \textbf{bold} the best result per context setting on 7-8B models, and we \underline{underline} the best result considering also the 70B model.
Due to the high length of the responses in the Entity set, we do not show EM scores, as they are close to zero.}
\label{tab:results_1K}
\end{table}


\section{Limitations}\label{sec:limitations}
\paragraph{Historical and Linguistic Context}
In order to be copyright-free, INDAQA2 contains Italian literary texts spanning the period 1827-1948. 
As such, the language employed in these texts reflects the lexical and grammatical conventions of 19th and early 20th century Italian, including archaic vocabulary and syntactic structures that may differ substantially from contemporary usage. 
Additionally, the vast majority of authors in the corpus are males, which mirror the gender disparities inherent in that period. 
Consequently, the narrative content may embody attitudes, ideologies, and perspectives that reflect the sociocultural aspects of the period, some of which diverge from contemporary ethical and social standards.

\paragraph{Synthetic Data}
Although the QA items underwent quality control procedures, it is important to note that they were generated using an LLM. 
As such, the dataset may exhibit certain limitations inherent to LLM-generated content, including potential factual inaccuracies and systematic biases that reflect the model's training data and architectural characteristics.
Our annotation process (\Cref{sec:annotation}) revealed that while questions were consistently well-posed and answerable, approximately 4.74\% of items contained errors, primarily in secondary reference answers.
Since our evaluation setup (\Cref{sec:evaluation}) computes the maximum of EM or METEOR scores across all references, the presence of at least one correct reference ensures valid answers are properly credited.
However, inaccurate secondary references could theoretically reward incorrect model responses that happen to align with those errors rather than the correct answer.
Given the low error rate and the fact that such biases manifest \textit{only} when a model's output matches the wrong reference instead of the correct one, we expect the practical impact on benchmark results to be limited.
Nevertheless, future work could involve manual correction of identified errors to further enhance dataset quality.

\paragraph{Evaluation Setup}
While METEOR provides a reasonable automated evaluation framework for our task, we acknowledge its limitations. Automatic metrics based on surface form similarity may not fully capture semantic correctness, especially for questions requiring reasoning or inference. An LLM-as-a-Judge framework, where a powerful language model evaluates answer quality according to rubrics, has been shown to yield scores more aligned with human judgments in recent work~\citep{bonomo-etal-2025-literaryqa,zheng2023judgingllmasajudgemtbenchchatbot}. However, such approaches introduce additional computational costs, potential biases from the judge model, and complications in reproducibility. For the purposes of establishing baseline performance and enabling rapid iteration, we consider METEOR a pragmatic choice that balances evaluation quality with practical constraints.


\section{Ethical issues}
Given the publication years of the dataset (1827-1948), the source texts reflect the historical and sociocultural context of their time and may contain biased, stereotypical, or otherwise sensitive portrayals related to gender, ethnicity, religion, violence, or other forms of toxicity.
While these aspects are intrinsic to the literary material and are preserved for research fidelity, their uncritical use may lead to the reproduction or amplification of outdated viewpoints in downstream applications.

We do not attempt to modify or filter such content, and we encourage users to interpret the dataset within its historical context and not to train models nor deploy models trained on it.
The dataset is intended \textbf{exclusively} for research and benchmarking purposes.

\section{Data licence and copyright issues}
The dataset is constructed from publicly available data sources.
The synthetic QA items do not have a copyright, as they respect the provider usage policy\footnote{https://policies.google.com/terms/generative-ai/use-policy}.
We will release the dataset upon acceptance for research use.
All original copyrights remain with the respective content owners, and the dataset does not redistribute proprietary or restricted material.
Users of the dataset are responsible for ensuring compliance with the licences of the original sources when using the data.


\begin{acknowledgments}
% Identification of funding sources and other support, and thanks to individuals and groups that assisted in the research and the preparation of the work should be included in an acknowledgment section, which is placed just before the reference section in your document.
Luca Gioffré, Luca Moroni, and Alberte Fernández-Castro gratefully acknowledge the support of AI Factory IT4LIA project and the CINECA support for access to high-performance computing facilities.
Elena Marafatto gratefully acknowledges the support of Agenzia per la Cybersicurezza Nazionale.
Roberto Navigli acknowledges the support of PNRR MUR project \texttt{PE0000013-FAIR}.
\end{acknowledgments}

%% The declaration on generative AI comes in effect
%% in Janary 2025. See also
%% https://ceur-ws.org/GenAI/Policy.html
\section*{Declaration on Generative AI}
During the preparation of this work, the authors used ChatGPT and Claude in order to: Grammar and spelling check, Formatting assistance, Improve writing style. 
After using these tools, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. 



%%
%% Define the bibliography file to be used
\bibliography{anthology-1,anthology-2,custom}

%%
%% If your work has an appendix, this is the place to put it.
\appendix\label{appendix}

\section{Prompts used}\label{appendix:prompts}
%In this Section, we present the prompts used during the generation of the QA items.
%Most of our items were generated with the prompt in~\Cref{fig:generation_prompt_local-level}, with the exception of \textit{Entity} QA items, for which we used the prompt in~\Cref{fig:entity_question_prompt}.
To ensure continuity with the QA items in the Summary-level split, we used essentially the same prompt used in~\citep{moroni-etal-2025-learned}, slightly tweaked, for \textit{Local} and \textit{Local alternative} items. 
For the Summary-level data split (the original INDAQA), the \textbf{summary} is provided as context and the model is asked to produce 20 items. 
Instead, for the Passage-level data split, we provide a \textbf{single passage} and ask to produce 3 QA items for \textit{Local} and 1 for \textit{Local (alternative)} setting (\Cref{fig:generation_prompt_local-level,fig:generation_prompt_local-level-alt}, respectively). 
For \textit{Entity} QA items, we devise a new prompt (\Cref{fig:entity_question_prompt}).

We report the prompt used for multiple-choice conversion (\Cref{fig:mc_conversion_prompt}) and the QA refinement step (in~\Cref{fig:correction_prompt}).

%%%%%% PROMPTS %%%%%%%
%Zero-shot inference prompt$
\begin{figure}[h!]
\begin{center}
\noindent\fbox{\colorbox{eggshell}{%
  \centering
  \begin{minipage}{0.93\textwidth}
  \ttfamily
    \textbf{User prompt}\\
    \{text\}\\
    Domanda: \{question\}\\
    \{choices\_block\}\\
    Risposta:
  \end{minipage}
    }
}
\end{center}
\caption{
    Prompt used for inference. 
    In the generative task, the \texttt{choices\_block} parameter is void.
}
\label{fig:inference_prompt}
\end{figure}

%% Local level %%
\begin{figure}[h]
\begin{center}
\noindent\fbox{\colorbox{eggshell}{%
  %\centering
  \begin{minipage}{0.93\textwidth}
    \normalfont\textbf{System prompt}\\
    \ttfamily
    Sei un esperto di letteratura. Il tuo compito è quello di generare domande e risposte sulla trama di un testo letterario.\\
    \\
    \normalfont\textbf{User prompt}\\
    \ttfamily
    TESTO: \{text\}\\
    \\
    Scrivi almeno 3 domande diverse relative alla trama del testo. Per ogni domanda, scrivi due possibili risposte, entrambe corrette e complete.\\
    Le domande devono essere chiare e non ambigue. 
    Se il testo è breve, genera almeno 2 domande.\\
    Le risposte devono essere brevi e rispecchiare fedelmente il testo originale. 
    Le risposte possono anche essere quasi identiche.\\
    Segui questo formato senza commentare:\\
    \\
    Domanda: <domanda>\\
    Risposta A: <risposta>\\
    Risposta B: <risposta>\\
    \\
    \normalfont\textbf{Assistant prompt}\\
    \ttfamily
    Domanda: 
  \end{minipage}
    }
}
\end{center}
\caption{
    Prompt used to generate the \textit{Local} QA items.
}
\label{fig:generation_prompt_local-level}
\end{figure}

%% Local Alternative level %%
\begin{figure}[h]
\begin{center}
\noindent\fbox{\colorbox{eggshell}{%
  %\centering
  \begin{minipage}{0.93\textwidth}
    \normalfont\textbf{System prompt}\\
    \ttfamily
    Sei un esperto di letteratura. Il tuo compito è quello di generare domande e risposte sulla trama di un testo letterario.\\
    \\
    \normalfont\textbf{User prompt}\\
    \ttfamily
    TESTO: \{text\}\\
    \\
    Ecco degli esempi di domande e risposte riguardo questo testo:\\
    \{question\_block\}\\
    \\
    Scrivi almeno un'altra domanda (diversa) relativa alla trama del testo. 
    Per ogni domanda, scrivi due possibili risposte, entrambe corrette e complete.\\
    Le domande devono essere chiare e non ambigue. 
    Se il testo è breve, genera comunque almeno una domanda.\\
    Le risposte devono essere brevi e rispecchiare fedelmente il testo originale. 
    Le risposte possono anche essere quasi identiche.\\
    Segui questo formato senza commentare:\\
    \\
    Domanda: <domanda>\\
    Risposta A: <risposta>\\
    Risposta B: <risposta>\\
    \\
    \normalfont\textbf{Assistant prompt}\\
    \ttfamily
    Domanda: 
  \end{minipage}
    }
}
\end{center}
\caption{
    Prompt used to generate the \textit{Local (alternative)} QA items.
}\label{fig:generation_prompt_local-level-alt}
\end{figure}

%% Entity level %%
\begin{figure}[h]
\begin{center}
\noindent\fbox{\colorbox{eggshell}{%
  \centering
  \begin{minipage}{0.93\textwidth}
  \ttfamily
    \textbf{System prompt}\\
    Sei un esperto di letteratura. Dati degli estratti e delle domande e risposte riguardanti un'entità (es., un personaggio letterario, un luogo, etc.), scrivi delle nuove domande e risposte che siano più generali (meno specifiche riguardo a dettagli del libro) ma sempre corrette e coerenti.
    \\\\
    \textbf{User prompt}\\
    Linee guida per le domande e risposte:\\
    - Le domande devono essere chiare e pertinenti.\\
    - Le domande e le risposte devono essere formulate in italiano corretto.\\
    - Le domande, più che chiedere dettagli specifici, devono riguardare l'arco narrativo del personaggio/il ruolo dell'entità nella storia.\\
    ---\\
    \{context\_block\}\\
    ---\\
    Task:\\
    Scrivi due/tre nuove domande riguardo \{entity\}, ognuna con una risposta, che siano più generali ma sempre corrette e coerenti secondo le linee guida.\\
    Segui esattamente questo formato senza commentare:\\
    Domanda: <domanda>\\
    Risposta: <risposta>
    \\\\
    \textbf{Assistant prompt}\\
    Domanda:
  \end{minipage}
    }
}
\end{center}
\caption{
    Prompt used to generate Entity QA items. The model is fed a list of passages and already generated QA items about the \texttt{entity} through \texttt{context\_block}.
}\label{fig:entity_question_prompt}
\end{figure}

%% MC Conversion %%
\begin{figure}[h]
\begin{center}
\noindent\fbox{\colorbox{eggshell}{%
  \centering
  \begin{minipage}{0.93\textwidth}
  \ttfamily
    \textbf{System prompt}\\
    Sei un esperto di letteratura. Dato un estratto, una domanda e le risposte giuste che lo riguardano, genera tre "distrattori" (e.g., risposte sbagliate) plausibili.\\
    \\
    \textbf{User prompt}\\
    Libro: \{title\}\\
    Estratto:\\
    \{context\}\\
    ---\\
    Domanda: \{question\}\\
    Risposta A: \{references[0]\}\\
    Risposta B: \{references[1]\}\\
    ---\\
    Task: scrvi tre risposte sbagliate ma plausibili che potrebbero confondere un lettore, basandoti sull'estratto fornito. Segui esattamente questo formato senza commentare:\\
    Distrattore X: <distrattore\_x>\\
    \\
    \textbf{Assistant prompt}\\
    Distrattore 1:
  \end{minipage}
    }
}
\end{center}
\caption{Prompt used to generate Entity Question items starting from passages and existing questions.}
\label{fig:mc_conversion_prompt}
\end{figure}

%% Correction prompt %%
\begin{figure}[h]
\begin{center}
\noindent\fbox{\colorbox{eggshell}{%
  \centering
  \begin{minipage}{0.93\textwidth}
  \ttfamily
    \textbf{System prompt}\\
    Sei un esperto di letteratura. Dato un estratto, decidi se la domanda e le risposte che lo riguardano sono appropriate secondo le linee guida.\\
    \\
    \textbf{User prompt}\\
    Libro: \{title\}\\
    Estratto:\\
    \{context\}\\
    ---\\
    Domanda: \{question\}\\
    Risposta A: \{references[0]\}\\
    Risposta B: \{references[1]\}\\
    ---\\
    Linee guida per le domande e risposte:\\
    - La domanda deve contenere abbastanza riferimenti in modo da non essere ambigua: eventi e personaggi VANNO SPECIFICATI BENE con riferimenti temporali, col nome proprio o con caratteristiche descritte nel testo, **in modo da renderla chiara rispetto all'intero libro** (evita termini generici come 'un uomo', 'una donna', 'uno scontro', 'un incontro', etc.).\\
    - La domanda deve essere formulata in modo da richiedere una risposta specifica e non generica.\\
    - La domanda non deve contenere frasi come 'nell'estratto fornito', 'nel libro', 'secondo te', 'cosa pensi di', 'come ti sembra', etc.; deve essere diretta e neutrale.\\
    - Le risposte devono essere ENTRAMBE corrette e complete; possono cambiare solo nel modo in cui sono scritte (parafrasi).\\
    - Le risposte non devono essere contenute nella domanda stessa.\\
    - La domanda e le risposte devono essere formulate in italiano corretto.\\
    ---\\
    Task:\\
    Valuta se la domanda e le risposte sono appropriate secondo le linee guida. Se sono appropriate, rispondi semplicemente con 'OK'. Se non lo sono, fornisci una versione corretta della domanda e delle risposte seguendo le linee guida ed usando l'estratto fornito; riscrivi tutti gli elementi corretti.
  \end{minipage}
    }
}
\end{center}
\caption{Prompt used to correct the generated QA items.}
\label{fig:correction_prompt}
\end{figure}



\section{Metadata Extraction}\label{appendix:metadata}

We extract bibliographic metadata for each document in our corpus by querying an LLM, using the Wikipedia pages of the documents to ground the generation process. 
We do not use Wikipedia metadata directly, as we found that for many pages they were missing or not updated with the information in the text.  
This appendix describes our metadata extraction pipeline and quality assurance procedures.

\subsection{Extraction Pipeline}
For each document in the corpus, we use the opening paragraph from its corresponding Wikipedia page as the primary information source. This introductory text typically contains the essential bibliographic information in a standardized format. We employ an LLM\footnote{We found \texttt{meta-Llama/Llama-3.1-8B-Instruct} capable enough for this simple task.} to extract three metadata fields through separate, targeted prompts:

\begin{itemize}
    \item \textbf{Publication year}: The year of first publication or composition
    \item \textbf{Author name}: The primary author or creator of the work
    \item \textbf{Title}: The canonical title of the work
    %\item \textbf{Genre}: Literary genres associated with the work
\end{itemize}

Each field is queried independently to maximize extraction accuracy and allow for field-specific error handling. 
We show an example for extracting the publication year in~\Cref{fig:year_prompt}. 
Similar structured prompts are used for author names, titles, and genres, with appropriate formatting constraints specified for each field type.

\begin{figure}[h]
\begin{center}
\noindent\fbox{\colorbox{eggshell}{%
  \centering
  \begin{minipage}{0.93\textwidth}
  \ttfamily
    \textbf{User prompt}\\
    Based on the following text from a Wikipedia page, extract only the publication year of the work. Respond with just the year as a four-digit number.\\
    \{wikipedia\_paragraph\}
    \\\\
    \textbf{Assistant prompt}\\
    Year:
  \end{minipage}
    }
}
\end{center}
\caption{
    Prompt used to extract the publication year from the starting paragraph of a Wikipedia page.
}
\label{fig:year_prompt}
\end{figure}



\subsection{Validation and Quality Control}
To ensure metadata quality, we cross-reference the information with Wikipedia data. 
When structured metadata fields are available directly from Wikipedia, we compare the LLM-extracted values against these canonical sources. Discrepancies trigger manual review.
When Wikipedia data are unavailable or ambiguous, we perform manual verification and correction. 



\section{Additional statistics}\label{appendix:stats}
We report the list of the authorship distribution in~\Cref{tab:author_distribution}, divided by data split.
In the Summary-level split, a few authors dominate the corpus: \textit{Carlo Goldoni} alone accounts for over one-third of the books (35.2\%), followed by \textit{Luigi Pirandello} (23.8\%) and \textit{Emilio Salgari} (13.3\%). 
The remaining authors each contribute only a small fraction of the data, highlighting a strong skew toward a handful of prolific writers.
This imbalance largely reflects the uneven coverage of authors on Wikipedia, as only documents for which a summary was available were included in the dataset.

In the \textit{Passage-level} split, the distribution is more balanced: the top author, \textit{Enrico Castelnuovo}, represents 8.1\% of the books, and most other authors contribute between 2\% and 7\%. 
This indicates that the passage-level split is more diverse with respect to authorship, and less dominated by a few prolific figures, complementing the summary-level data.  

We also show the distribution of publication years for the two data splits in~ \Cref{fig:year_distribution}.
The Summary-level split spans a wide temporal range, with a mean publication year around 1814 and a large standard deviation (206 years), reflecting the broad historical coverage of the original material (strong outliers are listed in~\Cref{tab:outliers}).
In contrast, the Passage-level distribution is more homogeneous, with a mean publication year around 1900, a small standard deviation (21 years), and no strong outliers.

Finally, we present how the books are divided into 20 equal-probability bins based on text length quantiles and the chosen book for each bin in~\Cref{fig:annotation_set}.

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{selected_book_histogram.pdf}
    \caption{Distribution of the books in the annotation set. We show the ID of the selected 20 books, one from each of the 20 equal-probability bins.}
    \label{fig:annotation_set}
\end{figure}


\begin{table}[t]
\centering
\begin{tabular*}{\textwidth}{@{\hspace{0.5cm}\extracolsep{\fill}}lrrlrr@{\hspace{0.5cm}}}
\toprule
\multicolumn{3}{c}{\textit{Summary-Level}} & \multicolumn{3}{c}{\textit{Passage-Level}} \\
\cmidrule(r){1-3} \cmidrule(l){4-6}
\textbf{Author} & \textbf{\# Books} & \textbf{\%\ \ } & \textbf{Author} & \textbf{\# Books} & \textbf{\%\ \ } \\
\midrule
Carlo Goldoni       & 127 & 35.2 & Enrico Castelnuovo & 8  & 8.1 \\
Luigi Pirandello    & 86  & 23.8 & Matilde Serao     & 7  & 7.1 \\
Emilio Salgari      & 48  & 13.3 & Antonio Beltramelli & 5 & 5.1 \\
Edgar Allan Poe     & 19  & 5.3  & Edmondo De Amicis  & 5 & 5.1 \\
Grazia Deledda      & 10  & 2.8  & Guido da Verona    & 5 & 5.1 \\
Giuseppe Giacosa    & 7   & 1.9  & Alfredo Panzini    & 5 & 5.1 \\
Pietro Metastasio   & 6   & 1.7  & Luigi Capuana      & 4 & 4.0 \\
Jules Verne         & 4   & 1.1  & Anton Giulio Barrili & 4 & 4.0 \\
William Shakespeare & 3   & 0.8  & Alfredo Oriani     & 3 & 3.0 \\
Antonio Fogazzaro   & 3   & 0.8  & Nicola Misasi      & 3 & 3.0 \\
Niccolò Machiavelli & 2   & 0.6  & Emilio De Marchi   & 3 & 3.0 \\
Charles Perrault    & 2   & 0.6  & Annie Vivanti      & 3 & 3.0 \\
Giovanni Verga      & 2   & 0.6  & F.T. Marinetti     & 3 & 3.0 \\
Paolo Mantegazza    & 2   & 0.6  & Salvatore Farina   & 2 & 2.0 \\
Matilde Serao       & 2   & 0.6  & Lucio D'Ambra      & 2 & 2.0 \\
Giovanni Pascoli    & 2   & 0.6  & Jack London        & 2 & 2.0 \\
Alessandro Verri    & 2   & 0.6  & Gerolamo Rovetta   & 2 & 2.0 \\
                    &     &      & Federico De Roberto & 2 & 2.0 \\
                    &     &      & E.A. Butti         & 2 & 2.0 \\
                    &     &      & Luigi Pirandello   & 2 & 2.0 \\
                    &     &      & Italo Svevo        & 2 & 2.0 \\
\bottomrule
\end{tabular*}
\caption{Authors with more than one book in the benchmark, divided by \textit{Summary-level} and \textit{Passage-level} splits, along with the number of books and relative percentages.}
\label{tab:author_distribution}
\end{table}


% ============================================================
% OUTLIER DETAILS
% ============================================================

% Year: -650
%   Author: Esiodo
%   Title: Lo scudo di Eracle

% Year: -30
%   Author: Quinto Orazio Flacco
%   Title: Satire

% Year: 150
%   Author: Luciano di Samosata
%   Title: La storia vera

% Year: 1298
%   Author: Marco Polo
%   Title: Il Milione

% Year: 1475
%   Author: Angelo Poliziano
%   Title: Stanze per la giostra

% Year: 1483
%   Author: Matteo Maria Boiardo
%   Title: Orlando innamorato

% Year: 1504
%   Author: Jacopo Sannazaro
%   Title: Arcadia

% Year: 1518
%   Author: Niccolò Machiavelli
%   Title: Mandragola

% Year: 1525
%   Author: Niccolò Machiavelli
%   Title: Clizia
\begin{figure}[h]
    \centering
    \includegraphics[width=1\linewidth]{publication_year_histogram.pdf}
    \caption{Publication year distribution of the books in the benchmark. Years have been extracted from the introductory paragraph of the related Wikipedia page (either automatically or manually).}
    \label{fig:year_distribution}
\end{figure}

\begin{table}[t]
\centering
\begin{tabular*}{\textwidth}{@{\hspace{0.5cm}\extracolsep{\fill}}llr@{\hspace{0.5cm}}}
\toprule
\textbf{Title} & \textbf{Author} & \textbf{Publication year} \\
\midrule
\textit{Lo scudo di Eracle}    & Esiodo               & \textasciitilde650~BCE\quad\quad\\
\textit{Satire}                & Quinto Orazio Flacco & \textasciitilde30~BCE\quad\quad\\
\textit{La storia vera}        & Luciano di Samosata  & \textasciitilde150~\phantom{B}CE\quad\quad\\
\textit{Il Milione}            & Marco Polo           & 1298~\phantom{B}CE\quad\quad\\
\textit{Stanze per la giostra} & Angelo Poliziano     & 1475~\phantom{B}CE\quad\quad\\
\textit{Orlando innamorato}    & Matteo Maria Boiardo & 1483~\phantom{B}CE\quad\quad\\
\textit{Arcadia}               & Jacopo Sannazaro     & 1504~\phantom{B}CE\quad\quad\\
\textit{Mandragola}            & Niccolò Machiavelli  & 1518~\phantom{B}CE\quad\quad\\
\textit{Clizia}                & Niccolò Machiavelli  & 1525~\phantom{B}CE\quad\quad\\
\bottomrule
\end{tabular*}
\caption{Books from the \textit{Summary-level} data split that are outliers with respect to the publication year distribution.}
\label{tab:outliers}
\end{table}


\section{Dataset examples}\label{appendix:examples}
In this Section we present four examples taken from INDAQA2, in order to clarify the structure of the dataset. 

\begin{figure}[h!]
\centering
\begin{minted}[  
    bgcolor=eggshell,
    frame=single,
    framesep=2mm,
    fontsize=\small,
    breaklines=true
]{json}
// summary_level
{
    "answers": [
        "In un villaggio della Foresta Nera.", 
        "Nella Foresta Nera, in un villaggio."
    ],
    "choices": [],                   // not available
    "entity": null,                  // not available
    "kind": "summary_question",
    "model": "gemini2-flash",
    "question": "Dove si svolge la festa di fidanzamento iniziale?",
    "question_id": "000_le_villi.summary.0",
    "source_paragraphs_ids": [],     // not available
    "source_questions_ids": [],      // not available
    "target": {                      
        "label": null,               // not available
        "text": null                 // not available
    }
}
\end{minted}
\caption{QA item from the summary-level data split. Please note the null/empty data fields for this combination.}
\label{fig:summary-level_sample}
\end{figure}

\begin{figure}[h]
\centering
\begin{minted}[  
    bgcolor=eggshell,
    frame=single,
    framesep=2mm,
    fontsize=\small,
    breaklines=true
]{json}
// passage_level - Local Question item 
{
    "answers": [
        "Giacometta Maldi", 
        "Giacometta"
    ],
    "choices": [
        "A. Carolina", 
        "B. Elena", 
        "C. Giacometta Maldi", 
        "D. Geltrude"
    ],
    "entity": null,                     // not available
    "kind": "local_question",
    "model": "gemini-2.5-flash",
    "question": "Come si chiama la giovane donna al centro delle attenzioni per il matrimonio?",
    "question_id": "00_ahi_giacometta_la_tua_ghirlandella.set-a.1",
    "source_paragraphs_ids": [0],
    "source_questions_ids": [],         // not available
    "target": {
        "label": "C", 
        "text": "Giacometta Maldi"
    }
}
\end{minted}
\caption{QA item from the Passage-level data split, Local Question set. Please note the null/empty data fields for this combination.}
\label{fig:Passage-level_sample_local}
\end{figure}

\begin{figure}[h]
\centering
\begin{minted}[  
    bgcolor=eggshell,
    frame=single,
    framesep=2mm,
    fontsize=\small,
    breaklines=true
]{json}
// passage_level - Alternative Local Question item
{
    "answers": [
        "Biondi", 
        "Erano biondi"
    ],
    "choices": [
        "A. Neri", 
        "B. Biondi", 
        "C. Castani", 
        "D. Rossi"
    ],
    "entity": null,                     // not available
    "kind": "local_question_alt",
    "model": "gemini-2.5-flash",
    "question": "Di che colore erano i capelli di Giacometta?",
    "question_id": "00_ahi_giacometta_la_tua_ghirlandella.set-b.1",
    "source_paragraphs_ids": [0],
    "source_questions_ids": [],         // not available
    "target": {
        "label": "B", 
        "text": "Biondi"
    }
}
\end{minted}
\caption{QA item from the Passage-level data split, Local Alternative Question set. Please note the null/empty data fields for this combination.}
\label{fig:Passage-level_sample_alt}
\end{figure}

\begin{figure}[h]
\centering
\begin{minted}[  
    bgcolor=eggshell,
    frame=single,
    framesep=2mm,
    fontsize=\small,
    breaklines=true
]{json}
// passage_level - Entity Questions
{
    "answers": ["La sua eccentricità e la tendenza a comportarsi in modo inappropriato o fuori luogo."],
    "choices": [],                      // not available
    "entity": "adalgisa",
    "kind": "entity_question",
    "model": "gemini-2.5-flash",
    "question": "Qual è una caratteristica distintiva del personaggio di Adalgisa?",
    "question_id": "00_ahi_giacometta_la_tua_ghirlandella.",
    "source_paragraphs_ids": [4, 8],
    "source_questions_ids": [0, 2, 4],
    "target": {                        
        "label": null,                  // not available 
        "text": null                    // not available
    } 
}
\end{minted}
\caption{QA item from the Passage-level data split, Entity Question set. Please note the null/empty data fields for this combination.}
\label{fig:Passage-level_sample_entity}
\end{figure}

\section{Regexes}\label{appendix:regexes}
In Table~\ref{tab:regex}, we report the regular expressions used to evaluate models in the multiple-choice setting. The list of regexes was defined by inspecting the outputs of selected models and identifying the most common patterns used to introduce the chosen option. 

\begin{table}[t]
\centering
\begin{tabular*}{\textwidth}{rl}
\toprule
\textbf{\#} & \textbf{Regex} \\
\midrule
1  & \texttt{r"Risposta \textbackslash(?([ABCD])\textbackslash)?"} \\
2  & \texttt{r"risposta corretta è la \textbackslash(?([ABCD])\textbackslash)?"} \\
3  & \texttt{r"risposta corretta è: \textbackslash(?([ABCD])\textbackslash)?"} \\
4  & \texttt{r"risposta corretta è \textbackslash(?([ABCD])\textbackslash)?"} \\
5  & \texttt{r"risposta \textbackslash(?([ABCD])\textbackslash)?"} \\
6  & \texttt{r"Risposta: \textbackslash(?([ABCD])\textbackslash)?"} \\
7  & \texttt{r"risposta: \textbackslash(?([ABCD])\textbackslash)?"} \\
8  & \texttt{r"Risposta è \textbackslash(?([ABCD])\textbackslash)?"} \\
9  & \texttt{r"risposta è \textbackslash(?([ABCD])\textbackslash)?"} \\
10 & \texttt{r"Risposta è la \textbackslash(?([ABCD])\textbackslash)?"} \\
11 & \texttt{r"risposta è la \textbackslash(?([ABCD])\textbackslash)?"} \\
12 & \texttt{r"\textbackslash n\textbackslash(?([ABCD])\textbackslash)?\textbackslash. "} \\
13 & \texttt{r"\textbackslash n\textbackslash(?([ABCD])\textbackslash)? "} \\
14 & \texttt{r"\textbackslash(?([ABCD])\textbackslash)?\textbackslash.? "} \\
15 & \texttt{r"\textasciicircum([ABCD])\textbackslash.?\$"} \\
\bottomrule
\end{tabular*}
\caption{Regex patterns used to extract the answer from the models' responses in the MCQA task.}
\label{tab:regex}
\end{table}



\end{document}

%%
%% End of file