Fusing Pre-Trained Language Models with Multimodal Prompts through Reinforcement Learning

Youngjae Yu; Jiwan Chung; Heeseung Yun; Jack Hessel; Jae Sung Park; Ximing Lu; Rowan Zellers; Prithviraj Ammanabrolu; Ronan Le Bras; Gunhee Kim; Yejin Choi

Fusing Pre-Trained Language Models with Multimodal Prompts through Reinforcement Learning

Youngjae Yu, Jiwan Chung, Heeseung Yun, Jack Hessel, Jae Sung Park, Ximing Lu, Rowan Zellers, Prithviraj Ammanabrolu, Ronan Le Bras, Gunhee Kim, Yejin Choi

Published: 01 Jan 2023, Last Modified: 20 May 2025CVPR 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Language models are capable of commonsense reasoning: while domain-specific models can learn from explicit knowledge (e.g. commonsense graphs [6] ethical norms [25]), and larger models like GPT-3 [7] mani-fest broad commonsense reasoning capacity. Can their knowledge be extended to multimodal inputs such as images and audio without paired domain data? In this work, we propose ‡ ESPER (Extending Sensory PErception with Reinforcement learning) which enables text-only pretrained models to address multimodal tasks such as visual commonsense reasoning. Our key novelty is to use rein-forcement learning to align multimodal inputs to language model generations without direct supervision: for example, our reward optimization relies only on cosine similarity derived from CLIP [52] and requires no additional paired (image, text) data. Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of multimodal text generation tasks ranging from captioning to commonsense reasoning; these include a new benchmark we collect and release, the ESP dataset, which tasks models with generating the text of several different domains for each image. Our code and data are publicly released at https://github.com/JiwanChung/esper.

Loading