ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document UnderstandingDownload PDF

Anonymous

16 Jan 2022 (modified: 05 May 2023)ACL ARR 2022 January Blind SubmissionReaders: Everyone
Abstract: We propose ERNIE-Layout, a knowledge enhanced pre-training approach for visual document understanding, which incorporates layout-knowledge into the pre-training of visual document understanding to learn a better joint multi-modal representation of text, layout and image. Previous works directly model serialized tokens from documents according to a raster-scan order, neglecting the importance of the reading order of documents, leading to sub-optimal performance. We incorporate layout-knowledge from Document-Parser into document pre-training, which is used to rearrange the tokens following an order more consistent with human reading habits. And we propose the Reading Order Prediction (ROP) task to enhance the interactions within segments and correlation between segments and a fine-grained cross-modal alignment pre-training task named Replaced Regions Prediction (RRP). ERNIE-Layout attempts to fuse textual and visual features in a unified Transformer model, which is based on our newly proposed spatial-aware disentangled attention mechanism. ERNIE-Layout achieves superior performance on various document understanding tasks, setting new SOTA for four tasks, including information extraction, document classification, document question answering.
Paper Type: long
0 Replies

Loading