Learning Inner Monologue and Its Utilization in Vision-Language Challenges

Diji Yang; Kezhen Chen; Jinmeng Rao; Xiaoyuan Guo; Yawen Zhang; Jie Yang; Yi Zhang

Learning Inner Monologue and Its Utilization in Vision-Language Challenges

Diji Yang, Kezhen Chen, Jinmeng Rao, Xiaoyuan Guo, Yawen Zhang, Jie Yang, Yi Zhang

Published: 23 Oct 2023, Last Modified: 28 Nov 2023SoLaR PosterEveryoneRevisionsBibTeX

Keywords: Large Language Models, Interpretability of LLMs, Language and Vision

TL;DR: Inspired by human cognition, we propose IMMO, which mimics inner monologue to improve the interpretability of multi-agent system while maintaining decent performance on two vision-language tasks.

Abstract: Inner monologue is an essential phenomenon for reasoning and insight mining in human cognition. In this work, we propose a novel approach for AI systems to simulate inner monologue. Specifically, we consider the communications between components in an LLM-centric system as inner monologues, and demonstrate inner monologue reasoning ability can be learned by supervised learning and reinforcement learning, and then be utilized to solve different complex vision-language problems in different domains. Driven by the power of Large Language Models (LLMs), two prominent methods for vision-language tasks have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. With inner monologue simulation, our approach achieves competitive performance with less training data and promising interpretability when compared with state-of-the-art models on two popular tasks.

Submission Number: 50

Loading