Dataless Text Classification: A Topic Modeling Approach with Document ManifoldOpen Website

2018 (modified: 16 Nov 2021)CIKM 2018Readers: Everyone
Abstract: Recently, dataless text classification has attracted increasing attention. It trains a classifier using seed words of categories, rather than labeled documents that are expensive to obtain. However, a small set of seed words may provide very limited and noisy supervision information, because many documents contain no seed words or only irrelevant seed words. In this paper, we address these issues using document manifold, assuming that neighboring documents tend to be assigned to a same category label. Following this idea, we propose a novel Laplacian seed word topic model (LapSWTM). In LapSWTM, we model each document as a mixture of hidden category topics, each of which corresponds to a distinctive category. Also, we assume that neighboring documents tend to have similar category topic distributions. This is achieved by incorporating a manifold regularizer into the log-likelihood function of the model, and then maximizing this regularized objective. Experimental results show that our LapSWTM significantly outperforms the existing dataless text classification algorithms and is even competitive with supervised algorithms to some extent. More importantly, it performs extremely well when the seed words are scarce.
0 Replies

Loading