Vision-language models (VLMs) achieve impressive zero-shot performance on multimodal reasoning tasks. Typically, best reported performance is achieved with a zero- or a few-shot prompt. We observe that asking the model to take other routes of solving the same task, such as through code generation, hurts performance. Furthermore, training sets are typically no longer useful for improving model performance through few-shot learning, due to their use in training. Indeed, we observe that auto-prompting techniques such as DSPy \cite{khattab2023dspycompilingdeclarativelanguage}, when applied on training sets, do not produce few-shot examples that further improve validation performance. Further, when used in conjunction with program-of-thought, performance becomes even worse.
Our work overcomes these limitations by introducing a novel self-play programming interface which leverages the ability of VLMs to first generate code to decompose a complex visual reasoning task in sub-tasks, then use itself, or other models, as a tool to solve decomposed tasks. Our approach enables DSPy to not suffer from performance drops, when applied iteratively on training sets. Furthermore, it outperforms zero-shot baselines on difficult chart reasoning benchmarks. We report the performance of our approach on ChartQA, PlotQA and ChartFC. This enables large models, such as Gemini or GPT to autonomously learn how to use themselves as tools and iteratively improve without the need for additional data.