Abstract: Long-context understanding poses significant challenges in natural language processing, particularly for real-world dialogues characterized by speech-based elements, high redundancy, and uneven information density. Although large language models (LLMs) achieve impressive results on existing benchmarks, these datasets fail to reflect the complexities of such texts, limiting their applicability to practical scenarios. To bridge this gap, we construct the first spoken long-text dataset, derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-world scenarios.
We design tasks across three main categories—retrieval-dependent, reasoning-dependent, and hybrid—and evaluate both popular LLMs and specialized methods for their ability to understand long-contexts in these tasks. Our results reveal that current methods struggle to effectively process highly redundant texts, with clear preferences for specific task types but no single method excelling across all tasks. Based on our findings, we propose a simple yet strong baseline that addresses these challenges, achieving substantial improvements in performance. Our analysis offers valuable insights into the strengths and limitations of existing methods for processing spoken texts, laying the groundwork for advancing long-text understanding in real-world applications. As the first benchmark specifically designed for spoken long-text understanding, it not only tackles key challenges in this domain but also serves as a valuable resource for driving innovation in e-commerce applications.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Long-context understanding, spoken language, KV cache compression
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 5965
Loading