DocCT: Shift Document Image Classification Research from Format to ContentDownload PDF

Anonymous

16 Oct 2022 (modified: 05 May 2023)ACL ARR 2022 October Blind SubmissionReaders: Everyone
Abstract: Document image understanding is challenging, given the complexity of the combination of illustrations and text that makes up a document image. Previous document image classification datasets and models focus more on the document format while ignoring the meaningful content. In this paper, we introduce DocCT, the first-of-its-kind document image classification dataset that covers various daily topics that require understanding fine-grained document content to perform correct classification. Further, since previous image models cannot sufficiently understand the semantic content of document images, we present DocMAE, a new self-supervised pre-trained document image model. Experiments show that DocMAE's ability to understand fine-grained content is far greater than previous models and even surpasses OCR-based models, which proves that it is possible to well understand the semantics of document images only with the help of pixels.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
0 Replies

Loading