Abstract: Topic models uncover thematic structures in large document collections by assigning documents to topics and representing each topic as a ranked list of terms. However, these lists are often hard to interpret and insufficient for knowledge-intensive exploration, especially in scientific domains. We propose the task of $\textbf{Topic Description for Scientific Corpora}$, which focuses on generating structured, concise, and informative summaries for topic-specific document sets. To this end, we adapt two LLM-based pipelines: $\textit{Selective Context Summarisation}$ (SCS), which uses maximum marginal relevance to select representative documents; and $\textit{Compressed Context Summarisation}$ (CCS), a hierarchical approach based on the RAPTOR framework that recursively abstracts subsets of documents to compress the input. We evaluate both methods using SUPERT and a multi-model LLM-as-a-Judge across three topic modeling backbones (CTM, BERTopic, TopicGPT) and three scientific corpora. SCS consistently outperforms CCS in quality and robustness, while CCS performs better on larger topics despite a higher risk of information loss. Our findings highlight trade-offs between selective and compressed strategies and provide new benchmarks for topic-level summarisation. Code and data for two of the three datasets will be released.
Paper Type: Long
Research Area: Summarization
Research Area Keywords: multi-document summarization, topic modeling, scientific corpora, large language models
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 5312
Loading