SToRI: Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in Vision-Language Models

ACL ARR 2024 April Submission558 Authors

16 Apr 2024 (modified: 08 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: A text encoder within Vision-Language Models (VLMs) plays a crucial role in translating textual input into an embedding space shared with images, thereby facilitating the interpretative analysis of vision tasks through natural language. Despite varying significance of different textual elements within a sentence depending on the context, efforts to account for variation of importance when constructing text embeddings have been lacking. This paper proposes Semantic Token Reweighting to build Interpretable text embeddings (SToRI), which incorporates controllability as well. SToRI refines the text encoding process in VLMs by differentially weighting semantic elements based on contextual importance, enabling finer control over emphasis responsive to user preferences and data-driven insights. The efficacy of SToRI is demonstrated through comprehensive experiments, showcasing its strength in image retrieval tailored to user preferences and its capability in few-shot image classification tasks.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: interpretability, controllability, vision-language models
Contribution Types: Model analysis & interpretability
Languages Studied: English
Section 2 Permission To Publish Peer Reviewers Content Agreement: Authors decline to grant permission for ACL to publish peer reviewers' content
Submission Number: 558