Is a Good Description Worth a Thousand Pictures? Reducing Multimodal Alignment to Text-Based, Unimodal Alignment

Amin Memarian; Touraj Laleh; Irina Rish; Ardavan S. Nobandegani

Is a Good Description Worth a Thousand Pictures? Reducing Multimodal Alignment to Text-Based, Unimodal Alignment

Amin Memarian, Touraj Laleh, Irina Rish, Ardavan S. Nobandegani

Published: 17 Jun 2024, Last Modified: 02 Jul 2024ICML 2024 Workshop MHFAIA PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision-language models, multimodal alignment

Abstract: Generative AI systems (ChatGPT, Llama, etc.) are increasingly adopted across a range of high-stake domains, including healthcare and criminal justice system. This rapid adoption indeed raises moral and ethical concerns. The emerging field of AI alignment aims to make AI systems that respect human values. In this work, we focus on evaluating the ethics of multimodal AI systems involving both text and images --- a relatively under-explored area, as most alignment work is currently focused on language models. Specifically, here we investigate whether the multimodal alignment problem (i.e., the problem of aligning a multimodal system) could be effectively reduced to the (text-based) unimodal alignment problem, wherein a language model would make a moral judgment purely based on a description of an image. Focusing on GPT-4 and LLaVA as two prominent examples of multimodal systems, here we demonstrate, rather surprisingly, that this reduction can be achieved with a relatively small loss in moral judgment performance in the case of LLaVa, and virtually no loss in the case of GPT-4.

Submission Number: 67

Loading