Abstract: This study introduces a novel preprocessing approach that applies dependency parsing to extract noun and verb heads, which are then used to generate unigram and n-gram representations. We investigate the trade-off between topic coherence and diversity in topic modeling, demonstrating how increased diversity enhances text pattern discovery. Using three preprocessing methods to train LDA models \cite{bleiLatentDirichletAllocation2003}, we find that while coherence decreases slightly, topic diversity increases significantly, leading to the identification of novel patterns. By prioritizing topics with multi-word complements, our approach improves result granularity and highlights the role of diversity in uncovering deeper textual structures. To further validate these findings, we recommend additional diversity metrics.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: topic modeling, knowledge tracing/discovering/inducing, multi-word expressions, part-of-speech tagging, dependency parsing
Languages Studied: English
Submission Number: 1366
Loading