Linking MWE occurrences in corpora with their sense inventory entries using Linguistic Linked Data technology

Ranka Stanković; Verginica Mititelu; Jan Odijk; Carole Tiberius; Voula Giouli; Milica Ikonić Nešić

Linking MWE occurrences in corpora with their sense inventory entries using Linguistic Linked Data technology

Ranka Stanković, Verginica Mititelu, Jan Odijk, Carole Tiberius, Voula Giouli, Milica Ikonić Nešić

Published: 27 May 2026, Last Modified: 27 May 2026UniDive 2026EveryoneRevisionsCC BY-SA 4.0

Keywords: lexicon, corpus, linked data, nif, ontolex-lemon

Working Group: WG2: Lexicon-corpus interface

Abstract: This work addresses the gap between lexical resources and corpora in the representation and analysis of multiword expressions (MWEs). Building on the UniDive COST Action, it proposes a Linguistic Linked Open Data (LLOD) framework that integrates lexicons, annotated parallel corpora, and knowledge graphs. MWEs are modeled using OntoLex-Lemon, while corpus attestations are encoded in the NLP Interchange Format (NIF) and CoNLL-RDF, enabling precise linking between lexical entries and textual occurrences. Cross-lingual relations are captured through standardized vocabularies, and all resources are deployed in a GraphDB environment for SPARQL-based querying. The approach is demonstrated on the ELEXIS-WSD dataset, showing how integrated infrastructures support multilingual exploration, interoperability, and advanced analysis of MWEs across lexicons and corpora.

WG2 Tasks: Task 2.2: Design of a lexicon-corpus interface, Task 2.3: Proof-of-concept lexicon encoding of MWEs

Tracks For Type Of Contribution: Work in progress

Do You Need Visa To Attend The 4th UniDive General Meeting In Romania: No

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 36

Loading