Combining and Extending Data Infrastructures with Linguistic Annotation Services

Published: 01 Jan 2015, Last Modified: 27 May 2025WLSI 2015EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper reports on a first prototype implementation for combining and extending a data infrastructure with linguistic processing services, bringing language datasets and basic language processing services together in a unified platform thus boosting the organic growth of data and facilitating language technology research and development. The META-SHARE data infrastructure is enhanced by providing a language processing mechanism for annotating content with appropriate NLP services that are documented with the appropriate metadata. Atomic services are combined into workflows modeled as an acyclic directed graph where each node corresponds to an NLP processing service (e.g. sentence splitting, part-of-speech tagging). Services run either locally or remotely. Currently, the language processing layer implements services and workflows for processing monolingual and bilingual content/resources in raw text, xces, tmx formats. From the legal framework point of view, a simple operational model is adopted by which only openly licensed datasets can be processed by openly licensed services and workflows.
Loading