Abstract: Text-to-visual retrieval often struggles with semantic redundancy and granularity mismatches between textual queries and visual content. Unlike existing methods that address these challenges during training, we propose VISual Abstraction (VISA), a test-time approach that enhances retrieval by transforming visual content into textual descriptions using off-the-shelf large models. The generated text descriptions, with their dense semantics, naturally filter out low-level redundant visual information. To further address granularity issues, VISA incorporates a question-answering process, enhancing the text description with the specific granularity information requested by the user. Extensive experiments demonstrate that VISA brings substantial improvements in text-to-image and text-to-video retrieval for both short- and long-context queries, offering a plug-and-play enhancement to existing retrieval systems.
Lay Summary: Finding the right image or video by typing a short description, like “a woman skiing with a child”, is harder for computers than it seems. Visuals often include details that don’t neatly match the words we use.
Our method, called VISA, improves this by turning images into clear text summaries using AI tools, and then refining them with smart follow-up questions. It doesn’t require retraining, and can easily plug into existing systems to make search results more accurate and meaningful.
Primary Area: Applications->Computer Vision
Keywords: text-to-image retrieval, text-to-video retrieval, plug-and-play
Submission Number: 681
Loading