Learning Visual Concepts via Vision Language Programs

Published: 22 Sept 2025, Last Modified: 22 Sept 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: visual reasoning, vision-language models, program synthesis
Abstract: Vision–language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning, leading to inconsistent or illogical outputs. Neuro-symbolic methods address this by inducing interpretable logical rules, yet they typically depend on domain-specific perception modules. We propose Vision Language Programs (VLPs), which combine the perceptual flexibility of VLMs with the systematic reasoning of program synthesis. Rather than embedding reasoning inside the VLM, our approach leverages the model to produce structured visual descriptions that are compiled into symbolic programs. These programs execute directly on images, remain consistent with task constraints, and provide human-interpretable explanations. Experiments on synthetic and real-world datasets demonstrate that VLPs outperform direct prompting, particularly on tasks requiring complex logical reasoning.
Submission Number: 378
Loading