Abstract: Image-text extraction is a critical research area in computer vision, with broad applications in document digitization and scene understanding. Traditional methods typically rely on specialized models designed for specific subtasks, which limits their generalizability across diverse scenarios. Recent progress in vision-language models has demonstrated significant potential for cross-scene text-related tasks. However, existing approaches often fail to adequately capture semantic features in text-related images and rely on rigid processing strategies, which struggle to adapt to the varying demands of textual scenes. To overcome these limitations, we introduce Text Large Language and Vision Assistant (T-LLaVA), an advanced vision-language model specifically optimized for text recognition. Specifically, we propose an effective activation mechanism that leverages text saliency and completeness to optimize processing routes, and a dynamic image slicing strategy that adapts to the spatial characteristics of input images. Extensive experiments show that T-LLaVA achieves competitive recognition accuracy across diverse scenarios, including natural scene text, cropped text, mathematical expressions, and document pages. These results validate the effectiveness and superiority of our proposed approach, highlighting its potential for robust and versatile text extraction in complex environments.
External IDs:dblp:conf/icdar/WeiYLZZY25
Loading