Constructing and validating word similarity datasets by integrating methods from psychology, brain science and computational linguistics

Yu Wan, Yidong Chen, Xiaodong Shi, Changle Zhou

2018 (modified: 16 Sept 2021)Soft Comput. 2018Readers: Everyone

Abstract: Human-scored word similarity gold-standard datasets are normally composed of word pairs with corresponding similarity scores. These datasets are popular resources for evaluating word similarity models which are the essential components for many natural language processing tasks. This paper proposes a novel multidisciplinary method for constructing and validating word similarity gold-standard datasets. The proposed method is different from the previous ones in that it introduces methods from three different disciplines, i.e., psychology, brain science and computational linguistics to validate the soundness of the constructed datasets. Specifically, to the best of our knowledge, this is the first time event-related potentials experiments are incorporated to validate the word similarity datasets. Using the proposed method, we finally constructed a Chinese gold-standard word similarity dataset with 260 word pairs and showed its soundness using the interdisciplinary validating methods. It should be noted that, although the paper only focused on constructing Chinese standard dataset, the proposed method is applicable to other languages.

0 Replies