Separate, Locate, and Align: Determine Context Relation of Scene Text From Multiple Perspectives in TextVQA

Chengyang Fang, Wenhui Jiang, Yuming Fang, Yuxin Peng, Yang Liu

Published: 01 Nov 2025, Last Modified: 26 Jan 2026IEEE Transactions on Circuits and Systems for Video TechnologyEveryoneRevisionsCC BY-SA 4.0

Abstract: Text-based Visual Question Answering (TextVQA) focuses on answering questions about the scene text in images. Most works in this field uses transformer based models to modeling the interaction of question and scene texts which means the scene texts will be treated as a natural language sentence and concatenated in reading order as a part of input. However, they ignore the fact that different from words in natural language sentence which have inherent context relation, the context relation of scene texts in images need to be determined. To tackle this problem, we propose a novel method named Separate, Locate and Align (SLA) that discriminate the context relation of scene texts from semantic, visual and spatial aspects. Specifically, based on scene texts with similar visual information (e.g. background color, font color, font style, etc.) having semantic contextual relations, we propose a Text Semantic Separate (TSS) module to discriminate the semantic relation between different scene texts according to their visual contextual information. Then, we introduce a Spatial Circle Position (SCP) module that helps the model discriminate the spatial relation between different scene texts. Last, we design a Visual Alignment (VA) module to help the model distinguish the visual relationships between different scene texts according to the color distribution differences. Extensive experiments show that our method outperforms existing alternatives on TextVQA and ST-VQA datasets without pre-training tasks.

External IDs:doi:10.1109/tcsvt.2025.3577188