An embedding method for unseen words considering contextual information and morphological information

Min-Sub Won, YunSeok Choi, Samuel Kim, CheolWon Na, Jee-Hyong Lee

2021 (modified: 19 Dec 2022)SAC 2021Readers: Everyone

Abstract: The performance1 of natural language processing has been greatly improved through the pre-trained language models, which are trained with a large amount of corpus. But the performance of natural language processing can be reduced by the OOV (Out of Vocabulary) problem. Recent language representation models such as BERT use sub-word tokenization that splits word into pieces, in order to deal with the OOV problem. However, since OOV words are also divided into pieces of tokens and thus represented as the weighted sum of the unusual words, it can lead to misrepresentation of the OOV words. To relax the misrepresentation problem with OOV words, we propose a character-level pre-trained language model called CCTE (Context Char Transformer Encoder). Unlike BERT, CCTE takes the entire word as an input and the word is represented by considering morphological information and contextual information. Experiments in multiple datasets showed that in NER, POS tagging tasks, the proposed model which is smaller than the existing pre-trained models generally outperformed. Especially, when there are more OOVs, the proposed method showed superior performance with a large margin. In addition, cosine similarity comparisons of word pairs showed that the proposed method properly considers morphological and contextual information of words.

0 Replies