Keywords: Benchmark, Multi-turn dialogues, Long Form Question Answering
TL;DR: A benchmark to evaluate knowledge-intensive long-form question answering for LLMs in multi-turn dialogues
Abstract: Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains. However, existing benchmarks are limited to single-turn dialogue, while multi-turn dialogue benchmarks typically assess other orthogonal capabilities rather than knowledge-intensive factuality. To bridge this critical gap, we introduce **KnowMT-Bench**, the *first* benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields, including medicine, finance, and law. To faithfully assess the model's real-world performance, KnowMT-Bench employs a dynamic evaluation setting where models generate their own multi-turn dialogue histories given logically progressive question sequences. The factual capability and information delivery efficiency of the *final-turn* answer are then evaluated via a human-validated automated pipeline. Our experiments on a diverse suite of LLMs show a clear degradation in both factual capability and information delivery efficiency within multi-turn contexts. We further probe the underlying causes and find that contextual noise, particularly relevant misinformation, along with increasing context length and the structure of the dialogues, substantially contributes to this degradation. In addition, experimental results in mitigation strategies demonstrate that structural context refinement and RAG can effectively alleviate these issues, with RAG notably capable of reversing this performance degradation. These findings underscore the importance of our benchmark for evaluating and enhancing LLMs' conversational factual capabilities in real-world applications. Code and data is available at [KnowMT-Bench](https://anonymous.4open.science/r/KnowMT-Bench-651D/)
Primary Area: datasets and benchmarks
Submission Number: 3911
Loading