Post-hoc Vocabulary Expansion for Embedding Models via Definition-Learning

ACL ARR 2026 January Submission2192 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vocabulary Expansion, Definition-Learning
Abstract: Embedding models have become a core component of modern Natural Language Processing (NLP). However, most widely used embedding models adopt tokenizers that rely on frequency- or probability-based segmentation, which overlooks the morphological structure of agglutinative languages such as Korean. Moreover, embedding model vocabulary is fixed after training and cannot be extended to incorporate new domain-specific words. To address these challenges, we propose an embedding framework that integrates morpheme-aware tokenization with definition-based vocabulary expansion and is readily extensible to other languages and backbone models. Also, we introduce a SLERP-based training objective to improve embedding alignment during dimensionality reduction. After training, the framework enables vocabulary expansion by generating token embeddings from definition sentences without additional fine-tuning. Experimental results show that our approach improves downstream task performance by $0.1710 \sim 0.2182$ in terms of Micro-F1 and Micro-Recall@K.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: pre-training, continual learning, transfer, morphological segmentation, subword representations
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: Korean
Submission Number: 2192
Loading