LongNovel: A Multi-Scale Benchmark for Hallucination Detection in Long-Context Novel Summarization

LongNovel: A Multi-Scale Benchmark for Hallucination Detection in Long-Context Novel Summarization

ACL ARR 2026 January Submission7232 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Hallucination Detection， Long-context Summarization， Large Language models

Abstract: Although context windows have expanded significantly in recent years, hallucinations in long-context summarization remain a challenge. Long novels are better suited than news or papers for researching these hallucinations, due to their intrinsic information and detailed descriptions of events and dialogues. However, current research lacks a Chinese benchmark for hallucination detection in long-context novels and does not fully explore how hallucinations change as the context grows longer. In this study, we propose LongNovel, the first long-context Chinese novel benchmark for hallucination detection. This benchmark is constructed from 29 books, ranging from 2k to 100k tokens. We design 8 hallucination types, employ a combination of Multi-Model Arbitration and Entity-Referenced Hallucination Generation to ensure both data authenticity and a balanced distribution of hallucination categories. Furthermore, we manually revise the contents in the test set to guarantee data reliability. Extensive experimental results demonstrate that LongNovel is a challenging benchmark. We release LongNovel for future research. https://anonymous.4open.science/r/LongNovel-60B194

Paper Type: Long

Research Area: Summarization

Research Area Keywords: long-form summarization, factuality, evaluation, hallucination detection

Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: Chinese

Submission Number: 7232

Loading