Nested Text Labelling Structures to Organize Knowledge in AI Applications for the Humanities and Social Sciences

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Generalized text markup, Natural Language Processing, Text annotation, Data collection, Automated content analysis, Multi-fragments, Multiple assessors, Machine learning, Artificial intelligence, Language modeling, Humanities knowledge, Semantic analysis, Named entity recognition, Relation extraction, Co-reference resolution
TL;DR: In the humanities and social sciences, there cannot be a single simple answer, which text annotation model is always the best, so we propose a general purpose multilevel system of data models for text annotation.
Abstract: Scientific literature has emerged to advance annotation frameworks incorporating multi-fragment and multi-assessor labelling protocols alongside contextual data. The application of such rigorously defined, expert-driven text annotations provides a foundation for developing machine learning models capable of performing automatic text markup. The paper aims to identify the knowledge representations suitable for both human annotators and machine learning processes, as well as various task types. Experience gained through a number of applied projects and research studies has shown that the answer is not that simple. We propose a multi-level approach to the data models used for text annotation. Given its applicability for tasks involving context, multi-assessor labelling, and the extraction of subjective textual categories, this paper delineates its conceptual and logical foundations, alongside the associated cases. The proposed framework comprises three nested data models, each distinguished by its level of complexity. The relational representation of textual annotations is flexible enough for a variety of annotation scenarios. It supports named entity recognition, relation extraction, semantic analysis, co-reference resolution, frame semantics, multi-span matching, etc. - at least 17 types of tasks whose inputs and outputs have fundamentally different structural complexities. The framework includes a core model, an extended set of entities, and their relations. The same dataset can be related to various tasks of significantly different types. The broad applicability of our framework is supported by the survey of 21 datasets and related tasks found in more than a thousand publications. The proposed methodology extends the scope of structured text annotation, advances the standardisation of content analysis procedures, and facilitates solutions for a broader spectrum of natural language processing tasks.
Primary Area: datasets and benchmarks
Submission Number: 21477
Loading