From Lexical Entries to Corpus Queries: Retrieving Multiword Expressions in Serbian

Published: 27 May 2026, Last Modified: 27 May 2026UniDive 2026EveryoneRevisionsCC BY-SA 4.0
Keywords: lexicon, corpus, queries, MWE, CQL
Working Group: WG2: Lexicon-corpus interface
Abstract: Multiword expressions (MWEs) pose significant challenges for corpus retrieval due to their structural variability, morphosyntactic flexibility, and inconsistent representation in lexicographic resources. This paper proposes a methodology for transforming lexicographic encodings of MWEs into effective and linguistically informed corpus queries. The study is motivated by the development of the Dictionary of the Modern Serbian Language and draws on data from the Leximirka lexical database, which contains over 250,000 entries, including approximately 23,000 MWEs. The proposed approach is based on a set of manually designed rules that systematically convert canonical lexicographic representations into Corpus Query Language (CQL) patterns. These queries are designed to capture key linguistic phenomena such as free word order, discontinuous expressions, optional elements, and morphosyntactic variation, while also addressing inconsistencies in part-of-speech tagging. Special attention is given to the role of dictionary metalanguage, particularly the use of brackets and alternative forms, which complicate direct transformation into query structures. The methodology prioritizes recall in order to ensure comprehensive retrieval of valid corpus attestations, with the assumption that precision can be improved through subsequent filtering. The approach is evaluated using Serbian corpora (SrpKor) and additional phraseological data from multiple dictionary sources. The results demonstrate that systematic transformation of MWEs into flexible query patterns significantly improves the integration of lexical resources with corpus data. This work contributes to computational lexicography and phraseology by providing a practical framework for linking structured lexical knowledge with empirical corpus evidence, enabling more robust and scalable analysis of MWEs.
WG2 Tasks: Task 2.2: Design of a lexicon-corpus interface
Tracks For Type Of Contribution: Work in progress
Do You Need Visa To Attend The 4th UniDive General Meeting In Romania: No
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 34
Loading