A Neurosymbolic Agent System for Compositional Visual Reasoning

08 Sept 2025 (modified: 25 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Neurosymbolic System, Vision-Language Agent, Compositional Visual Reasoning
TL;DR: This paper presents a neuro symbolic approach to developing a vision-language agent for compositional visual reason.
Abstract: The advancement in large language models (LLMs) and large vision models has fueled the rapid progress in multi-modal vision-language reasoning capabilities. However, existing vision–language models (VLMs) remain challenged by compositional visual reasoning. This paper presents VLAgent, a neuro-symbolic approach to developing a Vision-Language Agent system for efficient compositional visual reasoning with three novel features. First, VLAgent develops an interpretable visualization-enhanced two-stage neuro-symbolic reasoning system. The first stage is managed by a front-end engine that generates a structured visual reasoning plan (symbolic program script) for each compositional visual reasoning task by utilizing a pre-trained LLM powered with few-shot chain-of-thought in-context learning. The second stage is managed by a high-performance back-end engine. It transforms the planning script into executable code based on visual input (image or video) and the combination of neural models and symbolic functions and then performs a sequence of actions for the compositional visual reason task. Second, to ensure and enhance the quality of mapping the logic plan to a sequence of executable instructions, VLAgent introduces the SS-parser, which examines the syntax and semantic correctness of the planning script, detects and repairs the logic errors found in the LLM-generated logic plan before generating the executable program. Third, VLAgent introduces the execution verifier in critical reasoning steps to validate and refine its compositional reasoning results in a stepwise manner, for example, ensemble methods for critical visual reasoning and caption analysis for low-confidence compositional reasoning. Extensive experiments were conducted on six visual benchmarks and compared to a dozen SoTA visual reasoning models. The results show that VLAgent outperforms existing representative approaches to compositional visual reasoning, while enabling self-interpretable visualization for human-in-the-loop debugging. Our code and runtime logs are available at https://anonymous.4open.science/r/VLAgent.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Submission Number: 3198
Loading