GDP: Generic Document Pretraining to Improve Document Understanding

Akkshita Trivedi, Akarsh Upadhyay, Rudrabha Mukhopadhyay, Santanu Chaudhury

Published: 01 Jan 2024, Last Modified: 17 Jul 2025ICDAR (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we propose a novel pretraining approach for document analysis that advances beyond conventional methods. The approach, called the GDPerformer, trains a suite of unique architectures to predict both masked OCR tokens and masked OCR bounding boxes, fostering the network to learn document semantics such as structure and language. Our experiments with GDPerformerv1 and GDPerformerv2 show enhanced performance on various downstream tasks, including Semantic Entity Recognition and Extraction and Multi-Modal Document Classification with minimal task-specific data and generalization to a wide range of documents. Furthermore, our pretrained features exhibit robustness in handling noisy documents and can be easily extended to multiple languages. Our experiments indicate that the proposed pretraining strategy requires only 50K document images, making it particularly beneficial for low-resource languages.