Agent-as-Judge for Factual  Summarization of Long Narratives

Agent-as-Judge for Factual Summarization of Long Narratives

ACL ARR 2025 May Submission7574 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have demonstrated near-human performance in summarization tasks based on traditional metrics such as ROUGE and BERTScore. However, these metrics do not adequately capture critical aspects of summarization quality, such as factual accuracy, particularly for long narratives (>100K tokens). Recent advances, such as \textit{LLM-as-a-Judge}, address the limitations of metrics based on lexical similarity but still exhibit factual inconsistencies, especially in understanding character relationships and states. In this work, we introduce NarrativeFactScore (NFS), the first "Agent-as-a-Judge" framework that evaluates and refines factuality in narrative summarization. By leveraging a Character Knowledge Graph (CKG) extracted from input narrative, NarrativeFactScore evaluates the factuality and provides actionable guidance for refinement, such as identifying missing or erroneous facts. Our experimental results demonstrate that constructing the CKG enables reasoning with 1/3 of the factuality computation used in prior approach, and achieve three times higher correlation with human judgments. Furthermore, refinement with actionable guidance improves the quality of the summary.\footnote{\href{https://anonymous.4open.science/r/NFS-1240}{https://anonymous.4open.science/r/NFS-1240}}

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: knowledge graphs,knowledge base construction;long-form summarisation,retrieval,fact checking,explanation faithfulness,evaluation methodologies,metrics

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 7574

Loading