Self-paced Learning to Improve Text Row Detection in Historical Documents with Missing Labels

Mihaela Gaman, Lida Ghadamiyan, Radu Tudor Ionescu, Marius Popescu

Published: 01 Jan 2022, Last Modified: 01 Jun 2023ECCV Workshops (4) 2022Readers: Everyone

Abstract: An important preliminary step of optical character recognition systems is the detection of text rows. To address this task in the context of historical data with missing labels, we propose a self-paced learning algorithm capable of improving the row detection performance. We conjecture that pages with more ground-truth bounding boxes are less likely to have missing annotations. Based on this hypothesis, we sort the training examples in descending order with respect to the number of ground-truth bounding boxes, and organize them into k batches. Using our self-paced learning method, we train a row detector over k iterations, progressively adding batches with less ground-truth annotations. At each iteration, we combine the ground-truth bounding boxes with pseudo-bounding boxes (bounding boxes predicted by the model itself) using non-maximum suppression, and we include the resulting annotations at the next training iteration. We demonstrate that our self-paced learning strategy brings significant performance gains on two data sets of historical documents, improving the average precision of YOLOv4 with more than $$12\%$$ on one data set and $$39\%$$ on the other.

0 Replies