Abstract: Visual embedding models excel at zero-shot tasks like visual retrieval and classification.
However, these models cannot be used for tasks that contain ambiguity or require user in-
struction. These tasks necessitate an embedding model which outputs can use a natural
language instruction to control the representation of a visual embedding. Existing CLIP-
based approaches embed images and text independently, and fuse the result. We find that
this results in weak interactions between modalities, and poor user control over the repre-
sentation. We introduce ABC, an open-source multimodal embedding model that uses a
vision-language model backbone to deeply integrate image features with natural language
instructions. ABC achieves best-for-size performance on MSCOCO image-to-text retrieval
and is the top performing model on classification and VQA tasks in the Massive Multimodal
Embedding Benchmark. With a strongly unified vision-language representation, ABC can
use natural language to solve subtle and potentially ambiguous visual retrieval problems. To
evaluate this capability, we design CtrlBench, a benchmark that requires interleaving tex-
tual instructions with image content for correct retrieval. ABC advances the state of visual
embeddings, outputting high-quality visual representations with natural language control.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Zhe_Gan1
Submission Number: 5000
Loading