Abstract: Text-based Visual Question Answering (TextVQA) requires models to answer questions about the scene text in images by reasoning the context between the scene text and the question. Previous works demonstrated that clustering the scene text could help the model understand the context between different scene texts in the image. However, these methods cluster scene text solely based on spatial information, resulting in close scene text being grouped together even in the absence of semantic contextual relationships. In order to solve the above problem, we proposed a Segment then Match method. Specifically, we propose an OCR-carrier Segmentation and Matching module to segment texts and carriers in the scene image and help all OCR texts find the carrier that belongs to them. Then, we propose a Hierarchical Visual Feature Fusion module to facilitate semantic relevance judgment of OCR text from multiple visual perspectives, thereby aiding the answer reasoning process. Our proposed method outperforms state-of-the-art methods by 3.65% and 3.31% on TextVQA and ST-VQA datasets, respectively. Extensive experiments validate the effectiveness of our method.
External IDs:dblp:conf/icassp/FangLLLHM24
Loading