Self-play through Computational Runtimes improves Chart Reasoning

ACL ARR 2025 February Submission497 Authors

08 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

Vision-language models (VLMs) achieve impressive zero-shot performance on multimodal reasoning tasks. Typically, best reported performance is achieved with a zero- or a few-shot prompt. We observe that asking the model to take other routes of solving the same task, such as through code generation, hurts performance. Furthermore, training sets are typically no longer useful for improving model performance through few-shot learning, due to their use in training. Indeed, we observe that auto-prompting techniques such as DSPy \cite{khattab2023dspycompilingdeclarativelanguage}, when applied on training sets, do not produce few-shot examples that further improve validation performance. Further, when used in conjunction with program-of-thought, performance becomes even worse.

Our work overcomes these limitations by introducing a novel self-play programming interface which leverages the ability of VLMs to first generate code to decompose a complex visual reasoning task in sub-tasks, then use itself, or other models, as a tool to solve decomposed tasks. Our approach enables DSPy to not suffer from performance drops, when applied iteratively on training sets. Furthermore, it outperforms zero-shot baselines on difficult chart reasoning benchmarks. We report the performance of our approach on ChartQA, PlotQA and ChartFC. This enables large models, such as Gemini or GPT to autonomously learn how to use themselves as tools and iteratively improve without the need for additional data.

Paper Type: Long
Research Area: Question Answering
Research Area Keywords: vision question answering, multimodality, reasoning, multimodal applications, code generation and understanding
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: english
Submission Number: 497
Loading