Learning Multi-facet Embeddings of Phrases and Sentences using Sparse Coding for Unsupervised Semantic Applications

Haw-Shiuan Chang; Amol Agrawal; Andrew McCallum

Learning Multi-facet Embeddings of Phrases and Sentences using Sparse Coding for Unsupervised Semantic Applications

Haw-Shiuan Chang, Amol Agrawal, Andrew McCallum

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Withdrawn SubmissionReaders: Everyone

TL;DR: We propose an unsupervised way to learn multiple embeddings for sentences and phrases

Abstract: Most deep learning for NLP represents each word with a single point or single-mode region in semantic space, while the existing multi-mode word embeddings cannot represent longer word sequences like phrases or sentences. We introduce a phrase representation (also applicable to sentences) where each phrase has a distinct set of multi-mode codebook embeddings to capture different semantic facets of the phrase's meaning. The codebook embeddings can be viewed as the cluster centers which summarize the distribution of possibly co-occurring words in a pre-trained word embedding space. We propose an end-to-end trainable neural model that directly predicts the set of cluster centers from the input text sequence (e.g., a phrase or a sentence) during test time. We find that the per-phrase/sentence codebook embeddings not only provide a more interpretable semantic representation but also outperform strong baselines (by a large margin in some tasks) on benchmark datasets for unsupervised phrase similarity, sentence similarity, hypernym detection, and extractive summarization.

Keywords: lexical semantics, sparse coding, multi-mode embeddings, unsupervised sentence embedding, clustering, set decoder, semantic similarity

Original Pdf: pdf

5 Replies

Loading