Short-Text Clustering using Statistical Semantics

Sepideh Seifzadeh, Ahmed K. Farahat, Mohamed S. Kamel, Fakhri Karray

2015 (modified: 12 Nov 2022)WWW (Companion Volume) 2015Readers: Everyone

Abstract: Short documents are typically represented by very sparse vectors, in the space of terms. In this case, traditional techniques for calculating text similarity results in measures which are very close to zero, since documents even the very similar ones have a very few or mostly no terms in common. In order to alleviate this limitation, the representation of short-text segments should be enriched by incorporating information about correlation between terms. In other words, if two short segments do not have any common words, but terms from the first segment appear frequently with terms from the second segment in other documents, this means that these segments are semantically related, and their similarity measure should be high. Towards achieving this goal, we employ a method for enhancing document clustering using statistical semantics. However, the problem of high computation time arises when calculating correlation between all terms. In this work, we propose the selection of a few terms, and using these terms with the Nystr\"om method to approximate the term-term correlation matrix. The selection of the terms for the Nystr\"om method is performed by randomly sampling terms with probabilities proportional to the lengths of their vectors in the document space. This allows more important terms to have more influence on the approximation of the term-term correlation matrix and accordingly achieves better accuracy.

0 Replies