Keywords: Topic Modeling, TnT-LLM, Textonomy, Content Analysis, Large Language Models, Scalable Machine Learning, Open-Source LLMs
Abstract: Automating text content analysis via topic modeling with Large Language Models (LLM) faces a trilemma: a trade-off between interpretability, scalability, and the accessibility of open-source models. This paper argues for a task-oriented view of topic modeling and introduces Textonomy, an implementation of the two-stage TnT-LLM framework, as a practical solution. Textonomy first uses an LLM to iteratively generate a data-driven taxonomy from a small sample of document summaries. It then trains a lightweight classifier on LLM-generated pseudo-labels for efficient, large-scale inference. We conduct a rigorous evaluation against traditional (LDA), neural (BERTopic), and pure-LLM (TopicGPT) topic models on two distinct datasets: WikiText-103 and a corpus of US Congressional bills. To address reproducibility, we benchmark Textonomy using both proprietary (OpenAI) and open-source (Mistral) LLMs. Results show Textonomy achieves competitive or superior alignment with human-annotated ground-truth clusters while reducing computational costs by over 99% compared to TopicGPT. Our work demonstrates that classification-based frameworks can effectively solve common topic modeling tasks, offering a scalable path to highly interpretable, goal-driven content analysis.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: topic modeling, NLP in resource-constrained setting, human-subject application-grounded evaluations
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Theory
Languages Studied: english
Submission Number: 3831
Loading