Can Frozen Large Language Models Solve Visual Reasoning?

ACL ARR 2024 June Submission5094 Authors

16 Jun 2024 (modified: 22 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We present ReasonLM, a simple framework which utilizes a pre-trained, frozen large language model (LLM) for visual reasoning tasks, and achieves competitive performance on ACRE and MEWL. We demonstrate for the first time that a frozen LLM serves as a task-agnostic reasoning machine for diverse reasoning tasks that involve object recognition, causal induction, and relation modeling. ReasonLM does not rely on synthesizing symbolic programs or self-supervised visual representation learning. Rather, it learns an object-centric, light-weight visual encoder from scratch. Via its simplified design, we investigate the essential design choices for strong visual reasoning performance. Code and model will be released.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Visual Reasoning, Multimodality
Languages Studied: English
Submission Number: 5094
Loading