Decomposing Unitization and Typing for Efficient and Consistent Span-Bound Concept Annotation

ACL ARR 2026 January Submission6811 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data-Efficiency, Annotation, Named Entity Recognition
Abstract: In specialized domains that require expert annotators and high inter-annotator agreement, high-quality datasets with span-bound semantic concept annotations remain expensive to develop. Substantial resources are typically spent on $\textit{unitizing}$, the task of identifying precise span boundaries for entity mentions. Unitizing is a significant source of inter-annotator disagreement, a poor use of expensive domain expertise, and very time-consuming. We propose a lighter annotation procedure that concentrates manual efforts on typed position annotations, marking positions in the text that overlap with mentions of each entity type, abstracting away span boundary decisions. With as few as 100-200 example sentences, we train span boundary detection models to unitize typed position annotations. Through evaluation over three datasets: CRAFT (biomedical), GENIA (molecular biology), and POLIANNA (climate/energy policy text), we demonstrate that (1) annotating typed positions in the text instead of full concept annotation is a more efficient use of time in low-resource settings, and (2) model-inferred span boundaries result in higher agreement at both the annotator training and corpus annotation phases, without sacrificing utility.
Paper Type: Long
Research Area: Low-resource Methods for NLP
Research Area Keywords: Annotation in Resource-Constrained Settings, Span-bound Annotation, Named Entity Recognition
Contribution Types: Approaches to low-resource settings
Languages Studied: English
Submission Number: 6811
Loading