A Deep Learning Approach for Text Segmentation in Document AnalysisDownload PDFOpen Website

Published: 01 Jan 2020, Last Modified: 14 May 2023ACOMP 2020Readers: Everyone
Abstract: Text segmentation plays an essential role in both page segmentation and document reading comprehension. In this manuscript, we present a system to separate the page into homogeneous regions that can serve to extract information. Our approach is based on the U-Net network platform to extract text-lines, then the text lines will be read by an OCR system which is developed based on Convolutional Recurrent Neural Network (CRNN). We group the text-lines and OCR results simultaneously based on the idea from the DBSCAN algorithm. Our system also contains the support modules such as template matching and deskew to improve the performance. To materialize and evaluate ideas, we built a complete Vietnamese data set for training and testing. As a result, we get over 90% accuracy in both Vietnamese and English languages.
0 Replies

Loading