Generating Errors: OCR Post-Processing for IcelandicDownload PDF

Published: 20 Mar 2023, Last Modified: 17 Apr 2023NoDaLiDa 2023Readers: Everyone
Keywords: ocr, icelandic, transformers, artificial errors
TL;DR: Generating Errors: OCR post-processing for Icelandic
Abstract: We describe work on enhancing the performance of transformer-based encoder-decoder models for OCR post-correction on modern and historical Icelandic texts, where OCRed data are scarce. We trained six models, four from scratch and two fine-tuned versions of Google's ByT5, on a combination of real data and texts populated with artificially generated errors. Our results show that the models trained from scratch, as opposed to the fine-tuned versions, benefited the most from the addition of artificially generated errors.
4 Replies

Loading