Abstract: Rapid growth in the digitization of documents, such as paper-based invoices or receipts, has alleviated the demand for methods to process information accurately and efficiently. However, it has become impractical for humans to extract the data manually, as it is labor-intensive and time-consuming. Digital documents contain various components such as tables, key-value pairs and figures. Existing optical character recognition (OCR) methods can recognize texts, but it is challenging to extract the key-value pairs in unformatted digital invoices or receipts. Hence, developing an information extraction system with intelligent algorithms would be beneficial, as it can increase the workflow efficiency for knowledge discovery and data recognition. In this paper, a pipeline of the information extraction system is proposed with intelligent computing and deep learning approaches for classifying key-value pairs first, followed by linking the key-value pairs. Two key-value pairing rules are developed in the proposed pipeline. Various experiments with intelligent algorithms are conducted to evaluate the performance of the pipeline of information extraction system.
0 Replies
Loading