Abstract: Languages are dynamic, and new words or variations of existing ones appear over time. Also, the dictionary used by the distributed text representation models is limited. Therefore, methods that can handle unknown words (i.e., out-of-vocabulary – OOV) are essential for the quality of natural language processing systems. Although some techniques can handle OOV words, most of them are based only on one source of information (e.g., word structure or context) or rely on straightforward strategies unable to capture semantic or morphological information. In this study, we present FastContext, a method for handling OOV words that improves the embedding based on subword information returned by the state-of-the-art FastText through a context-based embedding computed by a deep learning model. We evaluated its performance using tasks of word similarity, named entity recognition, and part-of-speech tagging. FastContext performed better than FastText in scenarios where the context is the most relevant source to infer the meaning of the OOV words. Moreover, the results obtained by the proposed approach were better than the ones obtained by state-of-the-art OOV handling techniques, such as HiCE, Comick, and DistilBERT.
Loading