Abstract: The increasing reliance on OCR technologies to digitize documents has enabled large-scale automation but also introduced new challenges for information extraction systems. While state-of-the-art OCR engines perform well under ideal conditions, they remain prone to errors. Traditional OCR evaluation metrics like character and word error rates fail to capture the impact of such errors on downstream tasks, particularly when only semantically critical words are affected. In this paper, we systematically investigate the relationship between OCR quality and extraction accuracy in business documents, with a focus on key field extraction and line item recognition. We introduce a controlled evaluation framework that simulates realistic OCR noise scenarios by selectively injecting errors into clean datasets. Our experiments show that standard OCR metrics poorly reflect the impact of noise on information extraction performance and highlight the need for task-specific OCR evaluation protocols and more resilient pipelines tailored to real-world settings.
External IDs:dblp:conf/icadl/NguyenHDJC25
Loading