DeepErase: Weakly Supervised Ink Artifact Removal in Document Text ImagesDownload PDF

Published: 01 Nov 2019, Last Modified: 05 May 2023DI 2019Readers: Everyone
Keywords: document analysis, semantic segmentation, computer vision
TL;DR: Neural-based removal of document ink artifacts (underlines, smudges, etc.) using no manually annotated training data
Abstract: Still in 2019, many scanned documents come into businesses in non-digital format. Text to be extracted from real world documents is often nestled inside rich formatting, such as tabular structures or forms with fill-in-the-blank boxes or underlines whose ink often touches or even strikes through the ink of the text itself. Such ink artifacts can severely interfere with the performance of recognition algorithms or other downstream processing tasks. In this work, we propose DeepErase, a neural preprocessor to erase ink artifacts from text images. We devise a method to programmatically augment text images with real artifacts, and use them to train a segmentation network in an weakly supervised manner. In additional to high segmentation accuracy, we show that our cleansed images achieve a significant boost in downstream recognition accuracy by popular OCR software such as Tesseract 4.0. We test DeepErase on out-of-distribution datasets (NIST SDB) of scanned IRS tax return forms and achieve double-digit improvements in recognition accuracy over baseline for both printed and handwritten text.
1 Reply

Loading