Visually Guided Generative Text-Layout Pre-training for Document Intelligence

16 Jun 2023 (modified: 22 Mar 2024)Submitted to EMNLP 2023EveryoneRevisionsBibTeX
Submission Type: Regular Long Paper
Submission Track: Language Grounding to Vision, Robotics and Beyond
Keywords: Multimodal Pre-training, Visual Document Understanding
Abstract: Prior study shows that pre-training techniques can boost the performance of visual document processing, which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., text and table cell locations). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given an input document image, the model optimizes hierarchical language and layout modeling objectives to generate a mixed target sequence of texts and layouts. ViTLP can function as a native OCR model to locate and recognize texts of document images. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. Experiments show that ViTLP achieves promising performance compared to existing pre-trained baselines on various visual document understanding (VDU) tasks, including information extraction, document classification, and visual document question answering.
Submission Number: 4207
Loading