Multi-labeled document classification using semi-supervived mixture model of Watson distributions on document manifold

Nguyen Kim Anh, Ngo Van Linh, Nguyen Khac Toi, Nguyen The Tarn

Published: 2013, Last Modified: 13 Dec 2024SoCPaR 2013EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Classification of multilabel documents is essential to information retrieval and text mining. Most of existing approaches to multilabel text classification do not pay attention to relationship between class labels and input documents and also rely on labeled data all the time for classification. In fact, unlabeled data is readily available whereas generation of labeled data is expensive and error prone as it needs human annotation. In this paper, we propose a novel multilabel document classification approach based on semi-supervised mixture model of Watson distributions on document manifold which explicitly considers the manifold structure of document space to exploit efficiently both labeled and unlabeled data for classification. Our proposed approach models all labels within a dataset simultaneously, which lends itself well to the task of considering the relationship between these labels. The experimental results show that proposed method outperforms the state-of-the-art methods applying to multilabeled text classification.