Abstract: An important proportion of documents are document images, i.e. scanned documents. For their retrieval, it is important to recognize their contents. Current technologies for optical character recognition (OCR) and document analysis do not handle such documents adequately because of the recognition errors. In this paper, we describe an approach that integrates the detection of errors in scanned texts without relying on a lexicon, and this detection is integrated in the research process. The proposed algorithm consists of two basic steps. In the first step, we apply editing operations on OCR words that generate a collection of error-grams and correction rules. The second step uses query terms, error-grams, and correction rules to create searchable keywords, identify appropriate matching terms, and determine the degree of relevance of retrieved document images. Algorithms has been tested on 979 document images provided by Media-team databases from Washington University, and the experimental results obtained show the effectiveness of our method and indicate improvement in comparison with the standard methods such as exact or partial matching, N-gram overlaps, and Q-gram distance.
0 Replies
Loading