A Dataset for Analysing Complex Document Layouts in the Digital Humanities and Its Evaluation with Krippendorff's AlphaOpen Website

Published: 01 Jan 2022, Last Modified: 06 Nov 2023GCPR 2022Readers: Everyone
Abstract: We introduce a new research resource in the form of a high-quality, domain-specific dataset for analysing the document layout of historical documents. The dataset provides an instance segmentation ground truth with 19 classes based on historical layout structures that stem (a) from the publication production process and the respective genres (life sciences, architecture, art, decorative arts, etc.) and, (b) from selected text registers (such as monograph, trade journal, illustrated magazine). Altogether, the dataset contains more than 52,000 instances annotated by experts. A baseline has been tested with the well-known Mask R-CNN and compared to the state-of-the-art model VSR [55]. Inspired by evaluation practices from the field of Natural Language Processing (NLP), we have developed a new method for evaluating annotation consistency. Our method is based on Krippendorff’s alpha (K- $$\alpha $$ ), a statistic for quantifying the so-called “inter-annotator-agreement”. In particular, we propose an adaptation of K- $$\alpha $$ that treats annotations as a multipartite graph for assessing the agreement of a variable number of annotators. The method is adjustable with regard to evaluation strictness, and it can be used in 2D or 3D as well as for a variety of tasks such as semantic segmentation, instance segmentation, and 3D point cloud segmentation.
0 Replies

Loading