Light-Weight Multi-modality Feature Fusion Network for Visually-Rich Document Understanding

Jeff Yang, Huynh The Vu, Hai Luu Tuan

Published: 2024, Last Modified: 27 May 2026ICDAR (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Entity extraction (EE) is an important task in visually-rich document understanding (VrDU) which leverages multi-modal features of text, layout, and image. Recent transformer-based architectures enable an effective fusion of these features, showing great performance on the EE task. However, these models are heavy, leading to substantially high training cost and low inference speed. Thus, we propose a light-weight transformer-based model (named LMFFN) with a novel layout-self-attention layout-aware multi-modal fusion mechanism that allows an efficient entity extraction. Specifically, the proposed framework uses just a simple pre-training objective coupled with an effective batch implementation. In addition, no constraints are required with regard to the input sequence length or the reading order. This relaxation gives our model an advantage when it comes to camera and skewed documents, as we observed a 7% F1-score improvement when we compared our model to previous SOTA models on camera data. Evaluation results of three public datasets (CORD, SROIE, and XFUND) show that our proposed architecture achieves competitive performance compared to recent SOTA models while having 5 to 10 times fewer parameters.
Loading