Segment then Match: Find the Carrier before Reasoning in Scene-Text VQA

Chengyang Fang, Liang Li, Jiapeng Liu, Bing Li, Dayong Hu, Can Ma

Published: 2024, Last Modified: 25 Mar 2026ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Text-based Visual Question Answering (TextVQA) requires models to answer questions about the scene text in images by reasoning the context between the scene text and the question. Previous works demonstrated that clustering the scene text could help the model understand the context between different scene texts in the image. However, these methods cluster scene text solely based on spatial information, resulting in close scene text being grouped together even in the absence of semantic contextual relationships. In order to solve the above problem, we proposed a Segment then Match method. Specifically, we propose an OCR-carrier Segmentation and Matching module to segment texts and carriers in the scene image and help all OCR texts find the carrier that belongs to them. Then, we propose a Hierarchical Visual Feature Fusion module to facilitate semantic relevance judgment of OCR text from multiple visual perspectives, thereby aiding the answer reasoning process. Our proposed method outperforms state-of-the-art methods by 3.65% and 3.31% on TextVQA and ST-VQA datasets, respectively. Extensive experiments validate the effectiveness of our method.
Loading