Unsupervised Sparse Vector Densification for Short Text Similarity

Yangqiu Song, Dan Roth

2015 (modified: 16 Jul 2019)HLT-NAACL 2015Readers: Everyone

Abstract: Sparse representations of text such as bag-ofwords models or extended explicit semantic analysis (ESA) representations are commonly used in many NLP applications. However, for short texts, the similarity between two such sparse vectors is not accurate due to the small term overlap. While there have been multiple proposals for dense representations of words, measuring similarity between short texts (sentences, snippets, paragraphs) requires combining these token level similarities. In this paper, we propose to combine ESA representations and word2vec representations as a way to generate denser representations and, consequently, a better similarity measure between short texts. We study three densification mechanisms that involve aligning sparse representation via many-to-many, many-to-one, and oneto-one mappings. We then show the effectiveness of these mechanisms on measuring similarity between short texts.

0 Replies