Abstract: This paper presents a method for document layout analysis. This method applies the analyzing of whitespace in maximum homogeneous regions. This method focuses on the balance between processing time and performance. It consists of two main stages: classification and segmentation. Firstly, by using the analysis of whitespace analysis on Maximum multi-layer horizontal homogeneous regions, the text and non-text elements are classified. Then, text regions are extracted by using mathematical morphology. Besides, non-text elements are classified into separators, tables, images via a machine learning approach. The proposed method's effectiveness is proved by the tests on UW-III (A1) datasets.
Loading