Abstract: Image search with text feedback has promising impacts
in various real-world applications, such as e-commerce and
internet search. Given a reference image and text feedback
from user, the goal is to retrieve images that not only re-
semble the input image, but also change certain aspects in
accordance with the given text. This is a challenging task as
it requires the synergistic understanding of both image and
text. In this work, we tackle this task by a novel Visiolin-
guistic Attention Learning (VAL) framework. Specifically,
we propose a composite transformer that can be seamlessly
plugged in a CNN to selectively preserve and transform the
visual features conditioned on language semantics. By in-
serting multiple composite transformers at varying depths,
VAL is incentive to encapsulate the multi-granular visiolin-
guistic information, thus yielding an expressive represen-
tation for effective image search. We conduct comprehen-
sive evaluation on three datasets: Fashion200k, Shoes and
FashionIQ. Extensive experiments show our model exceeds
existing approaches on all datasets, demonstrating consis-
tent superiority in coping with various text feedbacks, in-
cluding attribute-like and natural language descriptions
0 Replies
Loading