An Experimental Study of Automating Explanatory Dictionary Compilation with Language Models

Timur Garipov, Dmitry Morozov, Yana Gubarkova, Anastasia Kozerenko, Anna Glazkova

Published: 01 Jan 2026, Last Modified: 05 Jan 2026CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: The creation of explanatory dictionaries has long been a cornerstone of classical linguistics. For the Russian language alone, dozens of such dictionaries have been compiled. However, the process of dictionary compilation is highly labor-intensive, requiring significant time and expertise. Moreover, the emergence of new words and the evolution of existing ones necessitate continuous updates to keep dictionaries relevant. Existing dictionaries are not without flaws; for instance, it is not uncommon for definitions to include terms that are more complex than the word being defined. Meanwhile, generative language models have reached a level of sophistication that allows them to tackle a wide range of applied tasks with near-expert proficiency. In this study, we explored whether the task of compiling a modern explanatory dictionary can be addressed using machine learning. We focused on two specific subtasks: 1) generating definitions for words not yet included in dictionaries, and 2) producing generalized definitions based on multiple existing dictionaries. We conducted a series of experiments that involved both fine-tuning and prompt-based approaches with language models. The quality of the generated definitions was evaluated using both automated metrics and human assessment. Our results demonstrate that while traditional sequence-to-sequence models like T5 and BART struggle with producing clear and accurate definitions, large language models (LLMs) yield significantly better results. At the same time, generating definitions from scratch works noticeably worse than generalizing existing ones. Our findings highlight the potential of LLM-based methods for automating dictionary compilation and suggest promising directions for further research in AI-assisted lexicography.
Loading