LAMBERT: Layout-Aware Language Modeling for Information Extraction

Lukasz Garncarek, Rafal Powalski, Tomasz Stanislawek, Bartosz Topolski, Piotr Halama, Michal Turski, Filip Gralinski

Published: 2021, Last Modified: 13 May 2025ICDAR (1) 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We introduce a simple new approach to the problem of understanding documents where non-trivial layout influences the local semantics. To this end, we modify the Transformer encoder architecture in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn language semantics from scratch. We only augment the input of the model with the coordinates of token bounding boxes, avoiding, in this way, the use of raw images. This leads to a layout-aware language model which can then be fine-tuned on downstream tasks.