Open Peer Review. Open Publishing. Open Access. Open Discussion. Open Directory. Open Recommendations. Open API. Open Source.
Automated Generation of Multilingual Clusters for the Evaluation of Distributed Representations
Philip Blair, Yuval Merhav, Joel Barry
Feb 16, 2017 (modified: Feb 16, 2017)ICLR 2017 workshop submissionreaders: everyone
Abstract:We propose a language-agnostic way of automatically generating sets of semantically similar clusters of entities along with sets of "outlier" elements, which may then be used to perform an intrinsic evaluation of word embeddings in the outlier detection task. We used our methodology to create a gold-standard dataset, which we call WikiSem500, and evaluated multiple state-of-the-art embeddings. The results show a correlation between performance on this dataset and performance on sentiment analysis.
TL;DR:Applying simple heuristics to the Wikidata entity graph results in a high-quality semantic similarity dataset.
Keywords:Natural language processing, Applications
Enter your feedback below and we'll get back to you as soon as possible.