Abstract: In this paper, we propose a graph-based representation of document collections in which both documents and features are represented by nodes. The nodes are connected with weights based on word order, context similarity and word frequency. Graph-based representations can overcome the limitations of bag-of-words based representations that suffer from sparseness for collections with short documents. In a series of experiments, we evaluate multiple types of graph-based text features in the context of semi-supervised text classification, and investigate the effect of the number of labeled documents in the collection. We find that graph-based semi-supervised learning outperforms bag-of-words semi-supervised learning but not bag-of-words supervised learning in 20-class text categorization. A large asset of graph-based representations is that they are flexible in the types of nodes and relations that are included.
0 Replies
Loading