- Abstract: We introduce similarity encoders (SimEc), which learn similarity preserving representations by using a feed-forward neural network to map data into an embedding space where the original similarities can be approximated linearly. The model can easily compute representations for novel (out-of-sample) data points, even if the original pairwise similarities of the training set were generated by an unknown process such as human ratings. This is demonstrated by creating embeddings of both image and text data. Furthermore, the idea behind similarity encoders gives an intuitive explanation of the optimization strategy used by the continuous bag-of-words (CBOW) word2vec model trained with negative sampling. Based on this insight, we define context encoders (ConEc), which can improve the word embeddings created with word2vec by using the local context of words to create out-of-vocabulary embeddings and representations for words with multiple meanings. The benefit of this is illustrated by using these word embeddings as features in the CoNLL 2003 named entity recognition task.
- TL;DR: Neural network way of doing kernel PCA and an extension of word2vec to compute out-of-vocabulary embeddings and distinguish between multiple meanings of a word based on its local context.
- Conflicts: tu-berlin.de
- Keywords: Natural language processing, Unsupervised Learning, Supervised Learning