ABC: Achieving Better Control of Visual Embeddings using VLLMs

TMLR Paper5000 Authors

31 May 2025 (modified: 03 Jun 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Visual embedding models excel at zero-shot tasks like visual retrieval and classification. However, these models cannot be used for tasks that contain ambiguity or require user in- struction. These tasks necessitate an embedding model which outputs can use a natural language instruction to control the representation of a visual embedding. Existing CLIP- based approaches embed images and text independently, and fuse the result. We find that this results in weak interactions between modalities, and poor user control over the repre- sentation. We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone to deeply integrate image features with natural language instructions. ABC achieves best-for-size performance on MSCOCO image-to-text retrieval and is the top performing model on classification and VQA tasks in the Massive Multimodal Embedding Benchmark. With a strongly unified vision-language representation, ABC can use natural language to solve subtle and potentially ambiguous visual retrieval problems. To evaluate this capability, we design CtrlBench, a benchmark that requires interleaving tex- tual instructions with image content for correct retrieval. ABC advances the state of visual embeddings, outputting high-quality visual representations with natural language control.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Zhe_Gan1
Submission Number: 5000
Loading