Nested Text Labelling Structures to Organize Knowledge in AI Applications for the Humanities and Social Sciences

ICLR 2026 Conference Submission21477 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Generalized text markup, Natural Language Processing, Text annotation, Data collection, Automated content analysis, Multi-fragments, Multiple assessors, Machine learning, Artificial intelligence, Language modeling, Humanities knowledge, Semantic analysis, Named entity recognition, Relation extraction, Co-reference resolution
TL;DR: In the humanities and social sciences, there cannot be a single simple answer, which text annotation model is always the best, so we propose a general purpose multilevel system of data models for text annotation.
Abstract: In the humanities and social sciences, recent research indicates that prevailing text annotation models do not always effectively and fully convey the nuances of expert knowledge, which in turn hinders the advanced application of AI. The paper aims at identifying the most suitable data models for both human annotators and machine learning processes, and the key issue is to deal with the trade-off between convenience and expressiveness of the annotation model. Experience gained through a number of applied projects and research studies has shown that there cannot be a single simple answer; this is why we propose a multi-level approach to data models used for text annotation. This article delineates its conceptual and logical foundations, alongside associated tasks. The proposed framework comprises three nested data models, each distinguished by its level of complexity. Based on a relational representation of textual annotations, this framework offers the flexibility required for a variety of annotation scenarios. It supports named entity recognition, relation extraction, semantic analysis, co-reference resolution, frame semantics, multi-span matching, etc. --- at least 17 types of tasks whose inputs and outputs have fundamentally different structural complexity. The framework includes a core model, an extended set of entities, and their relations. The same dataset can be related to various tasks, even tasks of significantly different types. The framework is capable of handling multiple annotations, multi-span elements, optional tags, and contextual metadata. The broad applicability of our framework is supported by the survey of 21 datasets and related tasks found in more than a thousand publications. The proposed approach broadens the horizons of structured text annotation, promoting the standardization of content analysis methodologies and enabling solutions to a diverse range of natural language processing tasks.
Primary Area: datasets and benchmarks
Submission Number: 21477
Loading