Image Search with Text Feedback by Visiolinguistic Attention Learning

Yanbei Chen, Shaogang Gong, Loris Bazzani

08 Nov 2022OpenReview Archive Direct UploadReaders: Everyone

Abstract: Image search with text feedback has promising impacts in various real-world applications, such as e-commerce and internet search. Given a reference image and text feedback from user, the goal is to retrieve images that not only re- semble the input image, but also change certain aspects in accordance with the given text. This is a challenging task as it requires the synergistic understanding of both image and text. In this work, we tackle this task by a novel Visiolin- guistic Attention Learning (VAL) framework. Specifically, we propose a composite transformer that can be seamlessly plugged in a CNN to selectively preserve and transform the visual features conditioned on language semantics. By in- serting multiple composite transformers at varying depths, VAL is incentive to encapsulate the multi-granular visiolin- guistic information, thus yielding an expressive represen- tation for effective image search. We conduct comprehen- sive evaluation on three datasets: Fashion200k, Shoes and FashionIQ. Extensive experiments show our model exceeds existing approaches on all datasets, demonstrating consis- tent superiority in coping with various text feedbacks, in- cluding attribute-like and natural language descriptions

0 Replies