It’s Not About the Terms: Structured Topic Descriptions for Scientific Corpora

It’s Not About the Terms: Structured Topic Descriptions for Scientific Corpora

ACL ARR 2025 May Submission5312 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Topic models uncover thematic structures in large document collections by assigning documents to topics and representing each topic as a ranked list of terms. However, these lists are often hard to interpret and insufficient for knowledge-intensive exploration, especially in scientific domains. We propose the task of $\textbf{Topic Description for Scientific Corpora}$, which focuses on generating structured, concise, and informative summaries for topic-specific document sets. To this end, we adapt two LLM-based pipelines: $\textit{Selective Context Summarisation}$ (SCS), which uses maximum marginal relevance to select representative documents; and $\textit{Compressed Context Summarisation}$ (CCS), a hierarchical approach based on the RAPTOR framework that recursively abstracts subsets of documents to compress the input. We evaluate both methods using SUPERT and a multi-model LLM-as-a-Judge across three topic modeling backbones (CTM, BERTopic, TopicGPT) and three scientific corpora. SCS consistently outperforms CCS in quality and robustness, while CCS performs better on larger topics despite a higher risk of information loss. Our findings highlight trade-offs between selective and compressed strategies and provide new benchmarks for topic-level summarisation. Code and data for two of the three datasets will be released.

Paper Type: Long

Research Area: Summarization

Research Area Keywords: multi-document summarization, topic modeling, scientific corpora, large language models

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 5312

Loading