Linking MWE occurrences in corpora with their sense inventory entries using Linguistic Linked Data technology

Published: 27 May 2026, Last Modified: 27 May 2026UniDive 2026EveryoneRevisionsCC BY-SA 4.0
Keywords: lexicon, corpus, linked data, nif, ontolex-lemon
Working Group: WG2: Lexicon-corpus interface
Abstract: This work addresses the gap between lexical resources and corpora in the representation and analysis of multiword expressions (MWEs). Building on the UniDive COST Action, it proposes a Linguistic Linked Open Data (LLOD) framework that integrates lexicons, annotated parallel corpora, and knowledge graphs. MWEs are modeled using OntoLex-Lemon, while corpus attestations are encoded in the NLP Interchange Format (NIF) and CoNLL-RDF, enabling precise linking between lexical entries and textual occurrences. Cross-lingual relations are captured through standardized vocabularies, and all resources are deployed in a GraphDB environment for SPARQL-based querying. The approach is demonstrated on the ELEXIS-WSD dataset, showing how integrated infrastructures support multilingual exploration, interoperability, and advanced analysis of MWEs across lexicons and corpora.
WG2 Tasks: Task 2.2: Design of a lexicon-corpus interface, Task 2.3: Proof-of-concept lexicon encoding of MWEs
Tracks For Type Of Contribution: Work in progress
Do You Need Visa To Attend The 4th UniDive General Meeting In Romania: No
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 36
Loading