Abstract: The relative position of text blocks plays a crucial role in document understanding. However, the task of embedding layout information in the representation of a page instance is not trivial. Computer Vision and Natural Language Processing techniques have been advancing in extracting content from document images considering layout features. We propose a set of Layout Quadrant Tags (LayoutQT) as a new way of encoding layout information in textual embedding. We show that this enables a standard NLP pipeline to be significantly enhanced without requiring expensive mid or high-level multimodal fusion. Given that our focus is on developing a low computational cost solution, we focused our experiments on the AWD-LSTM neural network. We evaluated our method for page stream segmentation and document classification tasks on two datasets, Tobacco800 and RVL-CDIP. In the former, our method improved the F1 score from 97.9% to 99.1% and in the latter the F1 score went from 80.4% to 83.6%. Similar levels of performance improvement were also obtained when we applied LayoutQT with BERT.
Loading