MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling

Liang Yin; Xudong Xie; Zhang Li; Xiang Bai; Yuliang Liu

MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling

Liang Yin, Xudong Xie, Zhang Li, Xiang Bai, Yuliang Liu

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: scene text retrieval, weak-supervised learning, fine-grained image text matching

TL;DR: We present a novel box-free method termed MSTAR for multi-query scene text retrieval.

Abstract: Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Multi-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and $16k$ images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4\% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5\%. The code and datasets are available at \href{https://github.com/yingift/MSTAR}{https://github.com/yingift/MSTAR}.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 20746

Loading