SToRI: Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in Vision-Language Models
Abstract: A text encoder within Vision-Language Models (VLMs) plays a crucial role in translating textual input into an embedding space shared with images, thereby facilitating the interpretative analysis of vision tasks through natural language. Despite varying significance of different textual elements within a sentence, depending on the context or intended purpose, efforts to control the prominence of diverse textual information when constructing text embeddings have been lacking. This paper proposes a framework called Semantic Token Reweighting, aiming to incorporate Controllability while ensuring Interpretability of text embeddings (SToRI). SToRI refines the text encoding process in VLMs by differentially weighting semantic elements based on contextual importance, enabling finer control over emphasis responsive to user preferences and data-driven insights. The efficacy of SToRI is demonstrated through comprehensive experiments, showcasing its strength in image retrieval tailored to user preferences and its capability in few-shot image classification tasks.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability
Languages Studied: English
0 Replies
Loading