Keywords: lexical gap, multilingual large language models, embeddings, emotion, Korean, English
TL;DR: We propose a language-agnostic and taxonomy-free method for scalable cross-lingual lexical gap identification using embeddings from multilingual LLMs.
Abstract: Lexical gaps are words that do not exist in certain languages. They pose challenges for building multilingual lexical resources, for machine translation, and for cross-lingual transfer. Existing lexical gap detection relies on human judgments or fixed conceptual taxonomies. We propose a data-driven framework for identifying cross-lingual lexical gaps. We extracted contextualized embeddings from Korean-English bilingual LLMs for Korean-to-English and English-to-Korean translation pairs. Combinations of LLMs, embedding types, dimensionality, and orthogonal transformations across 100 train-test splits yielded 4000 distinct embedding spaces in each source language. In each space, we computed the semantic similarity between each source word and its nearest neighbor in the target language, and compared their distribution for gap words versus non-gap words. In 94\% (Korean-to-English) and 97\% (English-to-Korean) of embedding spaces, gap words showed weaker cross-lingual semantic alignment than non-gap words. Logistic classifiers trained on unaligned embedding spaces can reliably separate gap words from non-gap words, achieving AUCs of 0.81 (Korean-to-English) and 0.76 (English-to-Korean) and retrieving 18/19 Korean and 26/27 English gap words. This approach provides a language-agnostic and taxonomy-free method for scalable lexical gap identification.
Scope Confirmation: To the best of my judgment, this submission falls within the scope of CoNLL.
Primary Area Selection: Computational Psycholinguistics, Cognition and Linguistics
Secondary Area Selection: Typology and Multilinguality
Use Of Generative Artificial Intelligence Tools: Yes, other (specify below)
Other Use Of Generative Artificial Intelligence Tools: coding assistance, formatting of latex tables and figures, grammar check
Data Collection From Human Subjects: No
Submission Type: Archival: I certify that the submission has not been previously published, nor is the material in it under review by another journal or conference. Further, no material in it will be submitted for review at another conference or journal while under review by CoNLL 2026.
Submission Number: 189
Loading