Topic-Activated Document Exploration: A context-aware LLM-Powered Hierarchical Topic Generation and Labeling Framework

ACL ARR 2025 February Submission5331 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We propose Topic-Activated Document Exploration (TADE), a hierarchical topic generation and label framework that uses large language models (LLMs) to dynamically extract document contents based on semantic relevance to specific topics. TADE presents a substantial departure from embedding-based clustering methods for topic modeling which primarily produce results that over-index on linguistic similarity compared to semantic similarity. When applied to large corpora, BERTopic-based approaches often generate spurious topics with compound or overly narrow labels, making objective, automated assessment of themes within a corpus error-prone and thus requiring substantial human intervention to `massage' the results. By contrast, TADE's LLM-based generation and refinement of topics eliminates such noisy topics through a semantic algorithm, resulting in topic sets with greater distinctness between different topics. Furthermore, TADE enables hierarchical exploration of themes through context-aware subtopic generation and assignment, providing a top-down approach that contrasts with the bottom-up methodology typically employed in BERT-based methods. Experimental results show TADE outperforms traditional BERT and LDA topic models in interpretability, achieving superior topic coherence, comparable topic diversity, and better distribution balance.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: topic modeling, explanation faithfulness, free-text/natural language explanations, hierarchical & concept explanations
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Theory
Languages Studied: English
Submission Number: 5331
Loading