Can a Unimodal Language Agent Provide Preferences to Tune a Multimodal Vision-Language Model?

ACL ARR 2025 May Submission5745 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper proposes a method that enables a language agent to give feedback to a vision-language model (VLM) to adapt text generation to the agent's preferences. Using our proposed method, we find that the VLM can supply multimodal scene descriptions to help the LLM better understand multimodal context. Our method leads to improvements of more than 13% in absolute accuracy compared to the baseline multimodal approach. Extensive experiments provide insights on how and why the method works and its limitations.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodal Learning, Preference Optimization
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 5745
Loading