DocUnfold: Leveraging Unfolding Network and a Real-World Large-Scale Dataset for Handwriting Contamination Removal in Documents

Xuhang Chen, Ziyang Zhou, Zimeng Li, Xiujun Zhang, Yihang Dong, Kim-Fung Tsang

Published: 2026, Last Modified: 20 Apr 2026IEEE Trans. Consumer Electron. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Handwritten annotations and ink contaminants compromise the visual quality and readability of documents. While deep learning has demonstrated impressive results in related tasks, such as text enhancement, it has yet to be applied to the prevalent issue of handwriting removal, primarily due to the absence of foundational datasets and robust methodologies. To address this limitation, we construct HW5K, a comprehensive dataset comprising over 5,000 pairs of high-resolution ( $2047\times 2537$ ) document images affected by handwriting. This dataset encompasses a diverse range of handwriting styles, backgrounds, and document formats, thereby providing a robust foundation for advancing research in document image restoration. In conjunction with HW5K, we propose DocUnfold, a novel model specifically designed for high-quality restoration of contaminated document images. DocUnfold employs a multi-stage unfolding process to systematically extract and reconstruct multi-level image features. The model is featured with the DocShuffle module, which disrupts the spatial dependencies of degradation through random shuffling operations. It employs a multi-window partitioning mechanism alongside a corresponding windowed attention module to extract feature information after shuffling, enabling DocUnfold to effectively remove handwritten content while preserving the integrity of the original document. Extensive experiments demonstrate the efficacy of both our dataset and model, yielding impressive results in qualitative and quantitative evaluations. HW5K and DocUnfold collectively establish a new benchmark in the field of document handwriting removal. The link of code and data is https://github.com/CXH-Research/DocUnfold
Loading