Keywords: historic newspaper processing, digital history, computer vision
Abstract: The correct detection of article layout in historical newspaper pages remains challenging
but is important for Natural Language Processing ( NLP) and machine
learning applications in the field of digital history. Digital newspaper portals
typically provide Optical Character Recognition ( OCR) text, albeit of varying quality.
Unfortunately, layout information is often missing, limiting this rich source’s
scope. Our dataset is designed to address this issue for historic German-language
newspapers. The Chronicling Germany dataset contains 581 annotated historical
newspaper pages from the time period between 1852 and 1924. Historic domain
experts have spent more than 1,500 hours annotating the dataset. The paper presents
a processing pipeline and establishes baseline results on in- and out-of-domain test
data using this pipeline. Both our dataset and the corresponding baseline code are
freely available online. This work creates a starting point for future research in
the field of digital history and historic German language newspaper processing.
Furthermore, it provides the opportunity to study a low-resource task in computer
vision.
Flagged For Ethics Review: true
Submission Number: 712
Loading