Evaluating Transformers for OCR Post-Correction in Early Modern Dutch Comedies and Farces

ACL ARR 2024 June Submission1506 Authors

14 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In this paper, we investigate the performance on OCR post-correction in early modern Dutch of two types of transformers: large generative models and sequence-to-sequence models. To this end, we create a parallel corpus by automatically aligning OCR sentences to their ground truth from the EmDComF early modern Dutch comedies and farces corpus, and propose an alignment methodology that creates new segments based on combinations of newline splits. This improves the alignment between gold and OCR texts, which is essential for the creation of a high-quality parallel corpus. After filtering out misalignments, we fine-tune and evaluate both generative and sequence-to-sequence models. We find that mBART outperforms generative models for the automatic post-correction of early modern Dutch in the EmDComF corpus, correcting more OCR sequences and avoiding overgeneration.
Paper Type: Short
Research Area: NLP Applications
Research Area Keywords: historical NLP
Contribution Types: NLP engineering experiment, Reproduction study, Data resources, Data analysis
Languages Studied: Historical Dutch, early modern Dutch, Dutch
Submission Number: 1506
Loading