Automated Generation of Multilingual Clusters for the Evaluation of Distributed RepresentationsDownload PDF

03 May 2025 (modified: 23 Mar 2025)ICLR 2017 Invite to WorkshopReaders: Everyone
TL;DR: Applying simple heuristics to the Wikidata entity graph results in a high-quality semantic similarity dataset.
Abstract: We propose a language-agnostic way of automatically generating sets of semantically similar clusters of entities along with sets of "outlier" elements, which may then be used to perform an intrinsic evaluation of word embeddings in the outlier detection task. We used our methodology to create a gold-standard dataset, which we call WikiSem500, and evaluated multiple state-of-the-art embeddings. The results show a correlation between performance on this dataset and performance on sentiment analysis.
Keywords: Natural language processing, Applications
Conflicts: basistech.com, neu.edu
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/automated-generation-of-multilingual-clusters/code)
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview