Novel Preprocessing for Diverse LDA Topic Modeling

Novel Preprocessing for Diverse LDA Topic Modeling

ACL ARR 2025 February Submission1366 Authors

13 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This study introduces a novel preprocessing approach that applies dependency parsing to extract noun and verb heads, which are then used to generate unigram and n-gram representations. We investigate the trade-off between topic coherence and diversity in topic modeling, demonstrating how increased diversity enhances text pattern discovery. Using three preprocessing methods to train LDA models \cite{bleiLatentDirichletAllocation2003}, we find that while coherence decreases slightly, topic diversity increases significantly, leading to the identification of novel patterns. By prioritizing topics with multi-word complements, our approach improves result granularity and highlights the role of diversity in uncovering deeper textual structures. To further validate these findings, we recommend additional diversity metrics.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: topic modeling, knowledge tracing/discovering/inducing, multi-word expressions, part-of-speech tagging, dependency parsing

Languages Studied: English

Submission Number: 1366

Loading