DeepErase: Weakly Supervised Ink Artifact Removal in Document Text Images

Yike Qi; W. Ronny Huang; Qianqian Li; Jonathan L. Degange

DeepErase: Weakly Supervised Ink Artifact Removal in Document Text Images

Yike Qi, W. Ronny Huang, Qianqian Li, Jonathan L. Degange

Published: 01 Nov 2019, Last Modified: 06 Jul 2025DI 2019Readers: Everyone

Keywords: document analysis, semantic segmentation, computer vision

TL;DR: Neural-based removal of document ink artifacts (underlines, smudges, etc.) using no manually annotated training data

Abstract: Still in 2019, many scanned documents come into businesses in non-digital format. Text to be extracted from real world documents is often nestled inside rich formatting, such as tabular structures or forms with fill-in-the-blank boxes or underlines whose ink often touches or even strikes through the ink of the text itself. Such ink artifacts can severely interfere with the performance of recognition algorithms or other downstream processing tasks. In this work, we propose DeepErase, a neural preprocessor to erase ink artifacts from text images. We devise a method to programmatically augment text images with real artifacts, and use them to train a segmentation network in an weakly supervised manner. In additional to high segmentation accuracy, we show that our cleansed images achieve a significant boost in downstream recognition accuracy by popular OCR software such as Tesseract 4.0. We test DeepErase on out-of-distribution datasets (NIST SDB) of scanned IRS tax return forms and achieve double-digit improvements in recognition accuracy over baseline for both printed and handwritten text.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/deeperase-weakly-supervised-ink-artifact/code)

1 Reply

Loading