Region-Level Layout Generation for Multi-level Pre-trained Model Based Visual Information Extraction

Shuai Li, Xiao-Hui Li, Fei Yin, Lin-Lin Huang

Published: 01 Jan 2024, Last Modified: 11 Apr 2025ICPR (19) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multimodal pre-trained models have made significant advancements in the field of visual information extraction by jointly modeling textual, layout, and visual modalities, among which the layout information plays a key role in modeling document inherent structures. However, due to the diversity and complexity of document types and typography styles, it is still not fully studied on how to better model various document layouts comprehensively and hierarchically. Compared with single-level layout adopted by most previous works, multi-level layouts including word-level, segment-level, and region-level layouts can provide a more scientifically modeling of complex document structures. Considering that most existing OCR tools lack region-level layout outputs of high quality, which poses challenges for the utilization of multi-level layout information, we thus propose a region-level layout generation method named ReMe based on hierarchical clustering. By iteratively clustering and merging segment-level bounding boxes, ReMe aims to ensure that semantically related segments with strong correlations share the same region-level bounding boxes. ReMe can be seamlessly integrated into the existing multi-level layout information modeling methods with negligible cost. Experimental results show that after pretrained with only 2 million documents from the IIT-CDIP dataset, the model can achieve new state of the art results on downstream visual information extraction datasets, and the region-level layout information generated by ReMe can significantly enhance the model’s understanding of structured documents, especially the performance on the Relation Extraction task.