Chronicling Germany: An Annotated Historical Newspaper Dataset

23 May 2024 (modified: 13 Nov 2024)Submitted to NeurIPS 2024 Track Datasets and BenchmarksEveryoneRevisionsBibTeXCC BY 4.0
Keywords: historic newspaper processing, digital history, computer vision
Abstract: The correct detection of article layout in historical newspaper pages remains challenging but is important for Natural Language Processing ( NLP) and machine learning applications in the field of digital history. Digital newspaper portals typically provide Optical Character Recognition ( OCR) text, albeit of varying quality. Unfortunately, layout information is often missing, limiting this rich source’s scope. Our dataset is designed to address this issue for historic German-language newspapers. The Chronicling Germany dataset contains 581 annotated historical newspaper pages from the time period between 1852 and 1924. Historic domain experts have spent more than 1,500 hours annotating the dataset. The paper presents a processing pipeline and establishes baseline results on in- and out-of-domain test data using this pipeline. Both our dataset and the corresponding baseline code are freely available online. This work creates a starting point for future research in the field of digital history and historic German language newspaper processing. Furthermore, it provides the opportunity to study a low-resource task in computer vision.
Flagged For Ethics Review: true
Submission Number: 712
Loading