Boosting Document Layout Analysis with Graphic Multi-modal Data Fusion and Spatial Geometric Transformation
Keywords: Multi-modal Data Fusion, Document Intelligence
Abstract: Document layout analysis is essential for Document Intelligence, playing a pivotal role in automated understanding and processing of document content. Most existing approaches within this domain are predicated on computer vision techniques that concentrate on image modality, despite documents containing both rich visual and textual information. While recent advances in multi-modal approaches begin to incorporate word embeddings to enhance recognition capabilities, they also incur a substantial computational burden. Moreover, the diversity of document structures demands models with great robustness, especially during the document editing process. In this paper, we introduce pluggable and efficient data pre-processing strategies to boost the layout analysis performance. Firstly, we discover that element categories depend on relative relationships and propose a Graphical Multi-modal Data Fusion technique, which constructs a graph to establish connections between disparate textual segments. Secondly, in terms of structural diversity of documents, we devise a Spatial Geometric Transformation strategy to improve model robustness against layout alterations. Our methods operate during the pre-processing phase, which facilitates straightforward integration with existing models to achieve significant accuracy increase with negligible extra computations. Experimental results show that our strategies illustrate State-Of-The-Art performance across multiple document layout analysis datasets. We will make the code publicly available shortly.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8525
Loading