Learning Embeddings for Rare Words Leveraging Internet Search Engine and Spatial Location Relationships

Xiaotao Li, Shujuan You, Wai Chen

Published: 20 Jan 2021, Last Modified: 05 Jun 2025OpenReview Archive Direct UploadEveryoneCC BY-NC-ND 4.0

Abstract: Word embedding techniques depend heavily on the frequencies of words in the corpus and fail in providing reliable representations for low-frequency words and unseen words during training. To address this problem, we propose a novel algorithm to learn embeddings for rare words based on the Internet search engine and the spatial location relationships. Our algorithm proceeds in two steps. We firstly retrieve webpages corresponding to the rare word through the search engine and parse the returned results to extract a set of most related words. We average the vectors of the related words as the initial vector of the rare word. Then, the location of the rare word in the vector space is iteratively fine-tuned according to the order of its relevances to the related words. Compared to other approaches, our algorithm can learn more accurate representations for a wider range of vocabulary. We evaluate our learned embeddings on the word relatedness task, and the experimental results show that our algorithm achieves the state-of-the-art performance.