Decomposed Prompting for Vision and Language Arithmetic Reasoning

Anonymous

Decomposed Prompting for Vision and Language Arithmetic Reasoning

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Math problems that involve both vision and language pose a challenging multi-modal task that requires the integration of visual information, textual information, and strong numerical reasoning for adequately solving it. While large language models (LLMs) have achieved impressive performance on arithmetic word problems based solely on text, we found that introducing visual data significantly increases the difficulty. Specifically, the compositional task of counting objects following recognition becomes a formidable hurdle for large vision-language models (LVLMs). The dual demands of recognizing objects and performing arithmetic reasoning pose a significant challenge, hindering LVLMs from excelling in both tasks simultaneously. The commonly employed chain-of-thought (CoT) approach, designed for LLMs, proves ineffective when applied to this multimodal task. As an alternative to the demonstration-based CoT method, we propose a novel decomposition prompting approach, explicitly breaking down the task breakdown into two stages as follows. The first stage performs object detection and enumeration referenced within the mathematical problem. The second stage then leverages the output from stage one to directly address the posed question. Our results demonstrate that this approach leads to substantial performance improvements on established benchmarks for visual and language arithmetic problems. This breaks the chains of CoT, paving the way towards a multimodal breakdown approach.

Paper Type: short

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Preprint Status: There is no non-anonymous preprint and we do not intend to release one.

A1: yes

A2: n/a

A3: yes

B: yes

B1: yes

B1 Elaboration For Yes Or No: sections 3 and 4

B2: n/a

B3: n/a

B4: no

B4 Elaboration For Yes Or No: Clevr-Math is a synthetic dataset.

B5: n/a

B6: yes

B6 Elaboration For Yes Or No: section 4

C: no

C1: n/a

C2: yes

C2 Elaboration For Yes Or No: Section 4, we mention the GPT models were used with temperature 0

C3: yes

C3 Elaboration For Yes Or No: section 5

C4: n/a

D: no

D1: n/a

D2: n/a

D3: n/a

D4: n/a

D5: n/a

E: no

E1: n/a

0 Replies

Loading