INDAQA2 - A Large Italian Narrative QA Benchmark: A CALAMITA2026 Challenge

Luca Gioffré; Luca Moroni; Alberte Fernández-Castro; Elena Marafatto; Giacomo Garufi; Roberto Navigli

INDAQA2 - A Large Italian Narrative QA Benchmark: A CALAMITA2026 Challenge

Luca Gioffré, Luca Moroni, Alberte Fernández-Castro, Elena Marafatto, Giacomo Garufi, Roberto Navigli

Published: 13 Mar 2026, Last Modified: 06 May 2026EVALITA 2026EveryoneRevisionsCC BY 4.0

Keywords: Benchmark, Narrative QA, Literary, Long-Context, Italian

TL;DR: INDAQA2 is a long-context Italian QA benchmark (461 books, up to 250K tokens). Italian-specific models lag behind multilingual ones and struggle with long contexts, especially on open-ended vs. multiple-choice tasks.

Abstract: Long-context comprehension and reasoning remain largely underexplored in the evaluation of Italian Large Language Models (LLMs). Existing Italian benchmarks primarily focus on short or medium-length inputs, offering limited insight into models' ability to process extended narratives. To address this gap, we introduce INDAQA2, a substantially revised and expanded version of INDAQA, a benchmark for narrative question answering on original Italian literary texts. The new version comprises an expanded corpus of 461 total books, introduces a multiple-choice question answering format alongside the original open-ended tasks, and features manually curated texts drawn exclusively from works originally written in Italian, thus avoiding artifacts introduced by translation. The benchmark evaluates long-context understanding over complete books of up to 250K tokens, testing complementary comprehension skills through a dual-structure design: global narrative understanding, assessed via questions derived from book summaries, and local precision, assessed via questions grounded in specific passages and entity-level details. By supporting both open-ended and multiple-choice question answering formats, INDAQA2 enables evaluation of both generative capabilities and discriminative reasoning, facilitating comprehensive and scalable comparison across models. Our evaluation of several Italian-specialized and multilingual models reveals significant performance disparities across task formats and highlights limitations in how current Italian models utilize extended contexts.

Source: zip

Ceur: pdf

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 11

Loading