Keywords: visual reasoning, finetuning large language models, instruction-based learning
TL;DR: We propose Cola, using a coordinative language model for visual reasoning. Cola coordinates multiple pretrained VLMs based on the visual context and plausible answers they provide.
Abstract: Visual reasoning demands multimodal perception and commonsense cognition of the world. Recently, multiple vision-language models (VLMs) have been proposed with excellent commonsense reasoning ability in various domains. However, how to harness the collective power of these complementary VLMs is rarely explored. Existing methods like ensemble still struggle to combine these models with the desired higher-order communications. In this work, we propose Cola, a novel paradigm that coordinates multiple VLMs for visual reasoning. Our key insight is that a language model (LM) can serve as an efficient coordinator to leverage the distinct and complementary capabilities of multiple VLMs. Extensive experiments demonstrate that our finetuning variant, Cola-FT, achieves state-of-the-art performance on outside knowledge VQA, visual entailment, and visual spatial reasoning tasks. Through systematic ablation studies and visualizations, we validate that a coordinator LM comprehends the instruction prompts and the separate functionalities of VLMs and then coordinates them to enable impressive visual reasoning capabilities.