Improving Scene Text Recognition in Multimodal Large Language Models using Visual Text Grounding

Shashank Krishna Vempati; Chetan Arora

Improving Scene Text Recognition in Multimodal Large Language Models using Visual Text Grounding

Shashank Krishna Vempati, Chetan Arora

Published: 09 May 2026, Last Modified: 09 May 2026MUSIEveryoneRevisionsCC BY 4.0

Keywords: Scene Text Understanding, Multimodal Large Language Models, Visual Grounding

TL;DR: VGST instruction-tunes multimodal LLMs using reverse text localization tasks and a reasoning dataset to improve spatial grounding for scene text. It significantly boosts text localization and recognition across multiple benchmarks.

Abstract: Recent advances in multimodal large language models (MLLMs) have enabled strong performance on vision–language tasks, yet they remain limited in spatial scene text understanding due to inadequate spatial grounding of text. In this work we propose Visual Grounding for Scene Text (VGST) to instruction-tune MLLMs for improved fine-grained text localization and recognition in complex, cluttered scenes. Specifically we introduce three tasks/objectives for reverse localization of text as an instruction-tuning mechanism, where the model is guided to extract textual content based on spatial localization cues, thereby enhancing its spatial grounding ability. To further enhance spatial text awareness, we curate a reasoning-centric dataset containing over 27,000 question–answer pairs spanning diverse real-world scenarios. We evaluate our model (VGST) on three benchmarks covering sparse to dense text distributions: SVT, Occluded RoadText, and HierText, where it consistently outperforms strong MLLM baselines. Specifically, VGST achieves relative improvements of 8.28%, 8.18%, and 27.3% in Character Recognition Rate (CRR) for the text reverse localization task; 5.48%, 5.2%, and 5.13% in recall for text localization; and 8.7%, 3.21%, and 2.45% in F1 scores for end-to-end text recognition, respectively. Prompt sensitivity analysis shows that instruction tuning on a specific task using varied prompt formulations leads to robust performance on that task, even when the prompts at inference differ from those seen during training. These results establish VGST as a reliable and effective solution for spatially aware scene text understanding in unconstrained real-world images. Our code and dataset are available here: https://anonymous.4open.science/r/VGST

Previously Accepted: No

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 20

Loading