SToRI: Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in Vision-Language Models

Anonymous

SToRI: Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in Vision-Language Models

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: A text encoder within Vision-Language Models (VLMs) plays a crucial role in translating textual input into an embedding space shared with images, thereby facilitating the interpretative analysis of vision tasks through natural language. Despite varying significance of different textual elements within a sentence, depending on the context or intended purpose, efforts to control the prominence of diverse textual information when constructing text embeddings have been lacking. This paper proposes a framework called Semantic Token Reweighting, aiming to incorporate Controllability while ensuring Interpretability of text embeddings (SToRI). SToRI refines the text encoding process in VLMs by differentially weighting semantic elements based on contextual importance, enabling finer control over emphasis responsive to user preferences and data-driven insights. The efficacy of SToRI is demonstrated through comprehensive experiments, showcasing its strength in image retrieval tailored to user preferences and its capability in few-shot image classification tasks.

Paper Type: long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: Model analysis & interpretability

Languages Studied: English

0 Replies

Loading