Can a Unimodal Language Agent Provide Preferences to Tune a Multimodal Vision-Language Model?

Can a Unimodal Language Agent Provide Preferences to Tune a Multimodal Vision-Language Model?

ACL ARR 2025 May Submission5745 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper proposes a method that enables a language agent to give feedback to a vision-language model (VLM) to adapt text generation to the agent's preferences. Using our proposed method, we find that the VLM can supply multimodal scene descriptions to help the LLM better understand multimodal context. Our method leads to improvements of more than 13% in absolute accuracy compared to the baseline multimodal approach. Extensive experiments provide insights on how and why the method works and its limitations.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Multimodal Learning, Preference Optimization

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 5745

Loading