Decomposed Prompting for Vision and Language Arithmetic ReasoningDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Math problems that involve both vision and language pose a challenging multi-modal task that requires the integration of visual information, textual information, and strong numerical reasoning for adequately solving it. While large language models (LLMs) have achieved impressive performance on arithmetic word problems based solely on text, we found that introducing visual data significantly increases the difficulty. Specifically, the compositional task of counting objects following recognition becomes a formidable hurdle for large vision-language models (LVLMs). The dual demands of recognizing objects and performing arithmetic reasoning pose a significant challenge, hindering LVLMs from excelling in both tasks simultaneously. The commonly employed chain-of-thought (CoT) approach, designed for LLMs, proves ineffective when applied to this multimodal task. As an alternative to the demonstration-based CoT method, we propose a novel decomposition prompting approach, explicitly breaking down the task breakdown into two stages as follows. The first stage performs object detection and enumeration referenced within the mathematical problem. The second stage then leverages the output from stage one to directly address the posed question. Our results demonstrate that this approach leads to substantial performance improvements on established benchmarks for visual and language arithmetic problems. This breaks the chains of CoT, paving the way towards a multimodal breakdown approach.
Paper Type: short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Preprint Status: There is no non-anonymous preprint and we do not intend to release one.
A1: yes
A2: n/a
A3: yes
B: yes
B1: yes
B1 Elaboration For Yes Or No: sections 3 and 4
B2: n/a
B3: n/a
B4: no
B4 Elaboration For Yes Or No: Clevr-Math is a synthetic dataset.
B5: n/a
B6: yes
B6 Elaboration For Yes Or No: section 4
C: no
C1: n/a
C2: yes
C2 Elaboration For Yes Or No: Section 4, we mention the GPT models were used with temperature 0
C3: yes
C3 Elaboration For Yes Or No: section 5
C4: n/a
D: no
D1: n/a
D2: n/a
D3: n/a
D4: n/a
D5: n/a
E: no
E1: n/a
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview