Abstract: Document image understanding is challenging, given the complexity of the combination of illustrations and text that makes up a document image. Previous document image classification datasets and models focus more on the document format while ignoring the meaningful content. In this paper, we introduce DocCT, the first-of-its-kind document image classification dataset that covers various daily topics that require understanding fine-grained document content to perform correct classification. Further, since previous image models cannot sufficiently understand the semantic content of document images, we present DocMAE, a new self-supervised pre-trained document image model. Experiments show that DocMAE's ability to understand fine-grained content is far greater than previous models and even surpasses OCR-based models, which proves that it is possible to well understand the semantics of document images only with the help of pixels.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
0 Replies
Loading