Say as It Is: Verbatim Fidelity Evaluation of Long-Context Language Model

Published: 10 Jun 2025, Last Modified: 10 Jun 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: verbatim, LLM, long-context language model
TL;DR: With our new evaluation framework, we find that long-context language models recall lengthy inputs reliably but struggle when reasoning and understanding are involved.
Abstract: Accurately processing long texts and generating precise responses remains a significant challenge for large language models (LLMs). While existing benchmarks evaluate long-text comprehension, they often overlook the models’ ability to faithfully preserve the exact wording, formatting, and sequence of prompts in their responses. To address this gap, we propose a novel evaluation framework with two key advantages: (i) adaptability across diverse domains and data sources, and (ii) tunable difficulty through dynamic variation of text length. Across three tasks—mathematical, contextual, and semantic reasoning—we find that even state-of-the-art long-context LLMs exhibit notable difficulty in maintaining verbatim fidelity during long-text generation.
Submission Number: 5
Loading