WikTDV: Data extraction and vector representation resource for Wiktionary senses

Danilo S. Carvalho, Minh-Le Nguyen

Published: 2017, Last Modified: 13 Jun 2023KSE 2017Readers: Everyone

Abstract: Effective use of collaborative web resources, such as Wikipedia and Wiktionary, has been a recurrent topic of research in the Natural Language Processing and Information Retrieval communities. The same can be said about the use of vector-based language representations, e.g., word, sentence, document embeddings. However, there is currently a shortage of resources that offer vector representations that can take advantage of the structural properties of web resources. This paper describes a system for extracting information from Wiktionary to a machine-readable format and using this information to obtain vector representations that can be used for semantic similarity computation and basic word sense disambiguation. The methodology used to build the system is also discussed. Experimental evaluation on the semantic similarity task indicate efficiency close to the reference method applied in this work. A web service and visualization facilities complete the set of contributions.

0 Replies