Keywords: Vocabulary Expansion, Definition-Learning
Abstract: Embedding models have become a core component of modern Natural Language Processing (NLP).
However, most widely used embedding models adopt tokenizers that rely on frequency- or probability-based segmentation, which overlooks the morphological structure of agglutinative languages such as Korean.
Moreover, embedding model vocabulary is fixed after training and cannot be extended to incorporate new domain-specific words.
To address these challenges, we propose an embedding framework that integrates morpheme-aware tokenization with definition-based vocabulary expansion and is readily extensible to other languages and backbone models.
Also, we introduce a SLERP-based training objective to improve embedding alignment during dimensionality reduction.
After training, the framework enables vocabulary expansion by generating token embeddings from definition sentences without additional fine-tuning.
Experimental results show that our approach improves downstream task performance by $0.1710 \sim 0.2182$ in terms of Micro-F1 and Micro-Recall@K.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: pre-training, continual learning, transfer, morphological segmentation, subword representations
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: Korean
Submission Number: 2192
Loading