Out-of-vocabulary word embedding learning based on reading comprehension mechanism

Published: 01 Jan 2023, Last Modified: 10 Jan 2025Nat. Lang. Process. J. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Currently, most natural language processing tasks use word embeddings as the representation of words. However, when encountering out-of-vocabulary (OOV) words, the performance of downstream models that use word embeddings as input is often quite limited. To solve this problem, the latest methods mainly infer the meaning of OOV words based on two types of information sources: the morphological structure of OOV words and the contexts in which they appear. However, the low frequency of OOV words themselves usually makes them difficult to learn in pre-training tasks by general word embedding models. In addition, this characteristic of OOV word embedding learning also brings the problem of context scarcity. Therefore, we introduce the concept of “similar contexts” based on the classical “distributed hypothesis” in linguistics, by borrowing from the human reading comprehension mechanisms to make up for the deficiency of insufficient contexts in previous OOV word embedding learning work. The experimental results show that our model achieved the highest relative scores in both intrinsic and extrinsic evaluation tasks, which demonstrates the positive effect of the “similar contexts” introduced in our model on OOV word embedding learning.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview