YOLO Assisted A* Algorithm for Robust Line Segmentation of Degraded Document Images.

Published: 08 Oct 2024, Last Modified: 07 Mar 2025Document Analysis and Recognition - {ICDAR} 2024 - 18th International ConferenceEveryoneRevisionsCC BY 4.0
Abstract: Although OCR from images of good quality documents can be considered as a solved problem, the same is not true whenever its quality gets degraded due to certain reasons such as its very old age. On the other hand, OCR of old documents has significant importance towards preservation of cultural heritage, indexing, retrieval etc. The task of degraded document OCR is often critical due to a number of reasons, including the high resemblance between noisy background and faded foreground pixels, asymmetric skews of different lines etc. The study presented in this article has been conducted on a dataset of recently collected sample images of old severely degraded document pages in addition to a few others and the task is very difficult due to the high degradation level of the samples and lack of training ground truths. Here, we propose a hybrid approach combining both of a learning-based and another rule-based methods for line segmentation of similar degraded documents. The proposed method utilizes well-known object detection system YOLO, trained on a publicly available dataset of handwritten samples, to predict starting point (left extreme point) of each line divider, the remaining part of the segmenting line has been obtained using a modified version of graph traversing approach ‘A* path finding’. Thus, the path of the segmenting line suitably dividing two consecutive text lines starting from the predicted left end point and terminating at the right end point could be obtained. The proposed approach has overcome various existing challenges of line segmentation of old degraded quality documents and improved results on several publicly available datasets. Performance comparisons of three existing strategies on five datasets of different languages and varying degradation levels, both of printed and handwritten texts have been presented in this article.
Loading