How Does Changing the Optical Character Recognition System Impact the Layout-Aware Named Entity Recognition Models?
Abstract: Merging information from physical and digital documents is essential in an era when information is becoming even more relevant. Different strategies have been used to combine knowledge from these two data sources. One state-of-the-art data extraction approach for this problem is the Named Entity Recognition (NER) strategy. However, even for those advanced models, the performance is still highly dependent on the Optical Character Recognition (OCR) system used to read the text from the physical documents. This paper investigates this dependence and how altering OCR systems between the training and inference phases influences NER performance. We verified that changing the OCR system negatively impacts the performance of data extraction models. Furthermore, we also show that models trained on less accurate OCR are more robust to OCR changes in the inference phase. The most accurate one regarding OCR errors should be preferred in scenarios where the OCR system is the same in the training and inference stages. We also propose a solution to mitigate this problem by mixing OCRs during the training phase. This approach enhances the model’s robustness while simultaneously preserving a high F1-score.
Loading