Enhancing LLM Reasoning via Vision-Augmented Prompting

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 spotlightEveryoneRevisionsBibTeXCC0 1.0
Keywords: Multimodal Large Language Models, Dual-Modality Reasoning
TL;DR: A dual-modality reasoning framework that enhances reasoning capabilities by incorporating self-synthesized visual and spatial information for problem-solving.
Abstract: Verbal and visual-spatial information processing are two critical subsystems that activate different brain regions and often collaborate together for cognitive reasoning. Despite the rapid advancement of LLM-based reasoning, the mainstream frameworks, such as Chain-of-Thought (CoT) and its variants, primarily focus on the verbal dimension, resulting in limitations in tackling reasoning problems with visual and spatial clues. To bridge the gap, we propose a novel dual-modality reasoning framework called Vision-Augmented Prompting (VAP). Upon receiving a textual problem description, VAP automatically synthesizes an image from the visual and spatial clues by utilizing external drawing tools. Subsequently, VAP formulates a chain of thought in both modalities and iteratively refines the synthesized image. Finally, a conclusive reasoning scheme based on self-alignment is proposed for final result generation. Extensive experiments are conducted across four versatile tasks, including solving geometry problems, Sudoku, time series prediction, and travelling salesman problem. The results validated the superiority of VAP over existing LLMs-based reasoning frameworks.
Supplementary Material: zip
Primary Area: Other (please use sparingly, only use the keyword field for more details)
Submission Number: 5110
Loading