LayoutQT - Layout Quadrant Tags to embed visual features for document analysis

Patricia Medyna Lauritzen de Lucena Drumond, Lindeberg Pessoa Leite, Teófilo E. de Campos, Fabricio Ataides Braz

Published: 01 Jan 2023, Last Modified: 20 May 2025Eng. Appl. Artif. Intell. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The relative position of text blocks plays a crucial role in document understanding. However, the task of embedding layout information in the representation of a page instance is not trivial. Computer Vision and Natural Language Processing techniques have been advancing in extracting content from document images considering layout features. We propose a set of Layout Quadrant Tags (LayoutQT) as a new way of encoding layout information in textual embedding. We show that this enables a standard NLP pipeline to be significantly enhanced without requiring expensive mid or high-level multimodal fusion. Given that our focus is on developing a low computational cost solution, we focused our experiments on the AWD-LSTM neural network. We evaluated our method for page stream segmentation and document classification tasks on two datasets, Tobacco800 and RVL-CDIP. In the former, our method improved the F1 score from 97.9% to 99.1% and in the latter the F1 score went from 80.4% to 83.6%. Similar levels of performance improvement were also obtained when we applied LayoutQT with BERT.