If I understand the context, I will act accordingly: Combining Complementary Information with Generative Visual Language Models

ACL ARR 2024 June Submission1731 Authors

14 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The effectiveness of autoregressive LLMs has allowed many language and vision tasks to be reframed as generative problems. Generative visual language models (VLMs) have recently shown potential across various downstream tasks. However, it is still an open question whether, and to what extent, these models can properly understand a multimodal context where language and vision provide complementary information---a mechanism routinely in place in human language communication. In this work, we test various VLMs on the task of generating action descriptions consistent with both an image's visual content and an intention or attitude (not visually grounded) conveyed by a textual prompt. Our results show that BLIP-2 is not far from human performance when the task is framed as a generative multiple-choice problem, while other models struggle. Furthermore, the actions generated by BLIP-2 in an open-ended generative setting are better than those by the competitors; indeed, human annotators judge most of them as plausible continuations for the multimodal context. Our study reveals substantial variability among VLMs in integrating complementary multimodal information, yet BLIP-2 demonstrates promising trends across most evaluations, paving the way for seamless human-computer interaction.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: language and vision, complementary, semantics, communication, multimodality
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 1731
Loading