Document Registration: Towards Automated Labeling of Pixel-Level Alignment Between Warped-Flat Documents

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Photographed documents are prevalent but often suffer from deformations like curves or folds, hindering readability. Consequently, document dewarping has been widely studied, however its performance is still not satisfied due to lack of real training samples with pixel-level annotation. To obtain the pixel-level labels, we leverage a document registration pipeline to automatically align warped-flat documents. Unlike general image registration works, registering documents poses unique challenges due to their severe deformations and fine-grained textures. In this paper, we introduce a coarse-to-fine framework including a coarse registration network (CRN) aiming to eliminate severe deformations then a fine registration network (FRN) focusing on fine-grained features. In addition, we utilize self-supervised learning to initialize our document registration model, where we propose a cross-reconstruction pre-training task on the pair of warped-flat documents. Extensive experiments show that we can achieve satisfied document registration performance, consequently obtaining a high-quality registered document dataset with pixel-level annotation. Without bells and whistles, we re-train two popular document dewarping models on our registered document dataset WarpDoc-R, and obtain superior performance with those using almost 100× scale of synthetic training data, verifying the label quality of our document registration method. The code and pixel-level labels will be released.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: Photographed documents are not only a common medium for humans to spread information but also a natural multi-modal application scenario that includes both images and texts. Due to the casualness of photo capture, problems such as curved, folded paper deformation, etc., greatly reduce readability on both human and optical character recognition(OCR). Nowadays, with the tide of large multi-modal models(LMM), even the most advanced LMM can only achieve limited performance on photographed document recognition and understanding. The proposed method greatly alleviates the labeling difficulty for photographed documents through a registration pipeline. we contribute an automated photographic document labeling method to obtain pixel-level annotations. In this way, we can provide a data basis for OCR and multi-modal large language models to understand photographed documents. We hope the proposed method will help build a data-centric study paradigm for document intelligence.
Supplementary Material: zip
Submission Number: 4585
Loading