TopicSummRAG: A Topic-Enhanced RAG model for Improving Query-Focused Summarization from Long Documents
Abstract: Despite larger context windows, large language models (LLMs) continue to struggle with answering queries over long, unstructured documents. Retrieval-Augmented Generation (RAG) mitigates this limitation by retrieving relevant text before generation; however, its effectiveness depends critically on how documents are segmented into retrievable units. Existing chunking strategies—such as fixed-size or sliding windows—often ignore topical coherence, frequently merging unrelated content or fragmenting coherent discourse, which degrades retrieval precision and downstream generation quality. We propose TopicSummRAG, a framework that aligns retrieval units with the latent topical structure of documents. TopicSummRAG first segments documents into contextually coherent topical chunks using a boundary-supervision-free contrastive text segmentation model, and then summarises each segment to form compact retrieval metadata. The segmentation component is evaluated independently on the QMSum and TIAGE benchmarks, where it consistently improves boundary detection and placement over strong baselines. During retrieval, segment-level summaries are matched against the query, and an entropy- and dominance-based filtering strategy adaptively selects relevant segments by measuring the concentration of relevance mass, avoiding brittle fixed cutoffs. We evaluate TopicSummRAG on ODSum-Story, ODSum-Meeting, and QMSum across multiple retrievers (BM25, SBERT, Situated) and LLMs (Qwen3-8B, LLaMA-3.1-8B, Gemma-3-12B). TopicSummRAG yields improvements of up to 13\% ROUGE-1, 25\% ROUGE-2, and 20\% ROUGE-L, alongside 10–15 point gains in LENS and up to 10-point improvements in BLANC. These results demonstrate that topic-aware segmentation and adaptive retrieval substantially improve factual grounding, coherence, and retrieval robustness, providing a retriever- and model-agnostic framework for long-document query-focused summarisation and question answering with RAG.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=9AQkpvLRbw&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: We thank the reviewers and editors for their time. We apologize for the formatting issues in the previous submission. In this revision, we have carefully addressed all formatting concerns and ensured full compliance with the TMLR submission guidelines. No changes were made to the technical content; the revisions are limited to formatting corrections and presentation improvements.
Assigned Action Editor: ~Li_Dong1
Submission Number: 7066
Loading