Beyond Moment : Rethinking Evaluation para-digm for Timeline Summarization in the era of LLMs

ICLR 2026 Conference Submission16024 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Timeline Summarization (TLS) Evaluation Metrics Semantic Alignment Large Language Models (LLMs) Bipartite Matching
TL;DR: This work proposes semantic alignment–based evaluation metrics and a full-stage LLM-TLS framework to advance timeline summarization.
Abstract: Timeline summarization (TLS) aims to condense large collections of temporally ordered documents into concise and coherent narratives of key events. While recent advances with large language models (LLMs) have improved, progress in TLS cannot be assessed objectively due to the lack of reliable evaluation metrics. Existing evaluation metrics rely on the assumption that milestones aligned at the same timestamp convey identical semantic meaning. This design choice inherently biases against abstractive or semantically equivalent outputs while emphasizing temporal consistency ( Date F1 and A-ROUGE ). Consequently, such evaluation protocols fail to adequately reflect the genuine improvements brought by LLM and deviate from human judgments when comparing the relative merits of different methods. To more faithfully assess whether the predicted timeline and the reference timeline truly refer to the same events, we propose a new evaluation framework in which all metrics are grounded on semantically aligned sentence pairs rather than merely time-aligned milestones. We leverage LLM to compute semantic similarity, align sentence pairs via maximum-weight bipartite matching, and compute a Pair-Match score. Building on this alignment, Date-F1 and ROUGE metrics are further introduced to jointly evaluate semantic coverage and temporal fidelity, which we term Pair-Date F1 and Pair-ROUGE, respectively. To validate the effectiveness of our proposed metrics, we introduce a full-stage LLM-TLS (FS-LLM-TLS) approach and conduct comparisons against prior methods. Experiments demonstrate that FS-LLM-TLS not only surpasses prior methods on existing evaluation metrics but also that its advantages are more faithfully and effectively reflected under our evaluation framework, which offers a fairer and more reliable assessment of method quality. This evaluation framework establishes a new paradigm for TLS evaluation, laying the foundation for future experimentation and system development.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16024
Loading