Short-text representation using diffusion waveletsOpen Website

Vidit Jain, Jay Mahadeokar

2014 (modified: 12 Nov 2022)WWW (Companion Volume) 2014Readers: Everyone
Abstract: Usual text document representations such as tf-idf do not work well in classification tasks for short-text documents and across diverse data domains. Optimizing different representations for different data domains is infeasible in a practical setting on the Internet. Mining such representations from the data in an unsupervised manner is desirable. In this paper, we study a representation based on the multi-scale harmonic analysis of term-term co-occurrence graph. This representation is not only sparse, but also leads to the discovery of semantically coherent topics in data. In our experiments on user-generated short documents e.g., newsgroup messages, user comments, and meta-data, we found this representation to outperform other representations across different choice of classifiers. Similar improvements were also observed for data sets in Chinese and Portuguese languages.
0 Replies

Loading