Perception-Enhanced Generative Transformer for Key Information Extraction from Documents

Runbo Zhao, Jun Jie Ou Yang, Chen Gao, Xugong Qin, Gangyan Zeng, Xiaoxu Hu, peng zhang

Published: 2024, Last Modified: 07 Apr 2026ICPR (31) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Key information extraction (KIE) from scanned documents has attracted significant attention due to practical real-world applications. Despite impressive results achieved by incorporating multimodal information within the generative framework, existing methods fail to understand complex layouts and fuzzy semantics in document images. To settle these issues, we propose a perception-enhanced generative transformer (PEGT), which improves the model through fine-grained multimodal modeling and pre-training tasks tailored for the generative framework. Firstly, we introduce a pre-trained vision-language model to provide transferable knowledge for visual text perceptron. Then two auxiliary pre-training tasks including absolute position prediction (APP) and semantic relationship reasoning (SRR) are designed for the generative framework. APP learns to predict which grids the texts fall into, improving the model on utilization of the position information. SRR exploits prior information of semantic relationships, injecting the ability for better semantic discrimination into PEGT. Finally, well-designed prompts are leveraged to unleash the potential of PEGT for extracting key information from documents. Extensive experiments on several public datasets show that PEGT effectively generalizes over different types of documents. Especially, PEGT achieves state-of-the-art results in terms of F-measure, i.e., 97.47%, 98.04%, and 84.32% on the SROIE, CORD, and FUNSD datasets, demonstrating the superiority of the proposed method.