TAP-VL: Text Layout Aware Pretraining for Enriched Vision-Language Models

Jonathan Fhima, Elad Ben-Avraham, Oren Nuriel, Yair Kittenplon, Roy Ganz, Aviad Aberdam, Ron Litman

Published: 2025, Last Modified: 15 Mar 2026ICCVW 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Vision-Language (VL) models have garnered considerable research interest; however, they still face challenges in effectively handling text within images. To address this limitation, researchers have developed two approaches. The first method involves utilizing external Optical Character Recognition (OCR) tools to extract textual information from images and prepend it to the textual inputs. The second strategy is OCR-free and focuses on employing extremely high-resolution images to improve text recognition capabilities. In this paper, we focus on enhancing the first strategy by introducing a novel method, named TAP-VL, which treats OCR information as a distinct modality and seamlessly integrates it into any VL model. TAP-VL employs a lightweight transformer-based OCR module to receive OCR with layout information, compressing it into a short fixedlength sequence which serves as an input for the LLM. To this end, we conduct model-agnostic pretraining of the OCR module on unlabeled documents, followed by its integration into any VL architecture through short fine-tuning. Extensive experiments demonstrate consistent performance improvements when applying TAP-VL to top-performing VL models, across scene-text and document-based benchmarks.
Loading