ABC: Achieving Better Control of Visual Embeddings using VLLMs

Benjamin Schneider; Florian Kerschbaum; Wenhu Chen

ABC: Achieving Better Control of Visual Embeddings using VLLMs

Benjamin Schneider, Florian Kerschbaum, Wenhu Chen

Published: 20 Aug 2025, Last Modified: 20 Aug 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Visual embedding models excel at zero-shot tasks like visual retrieval and classification. However, these models cannot be used for tasks that contain ambiguity or require user in- struction. These tasks necessitate an embedding model which outputs can use a natural language instruction to control the representation of a visual embedding. Existing CLIP- based approaches embed images and text independently, and fuse the result. We find that this results in weak interactions between modalities, and poor user control over the repre- sentation. We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone to deeply integrate image features with natural language instructions. ABC achieves best-for-size performance on MSCOCO image-to-text retrieval and is the top performing model on classification and VQA tasks in the Massive Multimodal Embedding Benchmark. With a strongly unified vision-language representation, ABC can use natural language to solve subtle and potentially ambiguous visual retrieval problems. To evaluate this capability, we design CtrlBench, a benchmark that requires interleaving tex- tual instructions with image content for correct retrieval. ABC advances the state of visual embeddings, outputting high-quality visual representations with natural language control. Our model and datasets are available at our project page: https://tiger-ai-lab.github.io/ABC/

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Zhe_Gan1

Submission Number: 5000

Loading