Abstract: We present a new hybrid document layout analysis approach to simultaneously detecting graphical page objects, group text-lines into text regions according to reading order, and recognize the logical roles of text regions from heterogeneous document images. For graphical page object detection, we leverage a state-of-the-art Transformer-based object detection model, namely DINO, as a new graphical page object detector to detect tables, figures, and (displayed) formulas in a top-down manner. Furthermore, we introduce a new bottom-up text region detection model to group text-lines located outside graphical page objects into text regions according to reading order and recognize the logical role of each text region by using both visual and textual features. Experimental results show that our bottom-up text region detection model achieves higher localization and logical role classification accuracy than previous top-down methods. Moreover, in addition to the locations of text regions, our approach can also output the reading order of text-lines in each text region directly. The state-of-the-art results obtained on DocLayNet and PubLayNet demonstrate the effectiveness of our approach.
Loading