Abstract: Document image understanding is challenging, given the complexity of the combination of illustrations and text that make up a document image. Previous document image classification datasets and models focus more on the document format while ignoring the meaningful content. In this paper, we introduce DocCT, the first-of-its-kind content-aware document image classification dataset that covers various daily topics that require understanding fine-grained document content to perform correct classification. Furthermore, previous pure vision models cannot sufficiently understand the semantic content of document images. Thus OCR is commonly adopted as an auxiliary component for facilitating content understanding. To investigate the possibility of understanding document image content without the help of OCR, we present DocMAE, a new self-supervised pretrained document image model without any extra OCR information assistance. Experiments show that DocMAE’s ability to understand fine-grained content is far greater than previous vision models and even surpasses some OCR-based models, which proves that it is possible to well understand the semantics of document images only with the help of pixels. (Dataset can be downloaded at https://github.com/zhenwangrs/DocCT).
Loading