NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: Neuro-Symbolic, Compositional Reasoning, Vision and Language
Abstract: **Problem.** Modern Vision-Language Models (VLMs) have achieved impressive performance in various domains, yet they often struggle with compositional reasoning, the ability to decompose and recombine concepts to solve novel problems. While neuro-symbolic approaches offer a promising direction, they are typically constrained by crisp logical execution or fail to generalize to predicates beyond their training domain, which limits flexibility. **Approach.** In this work, we introduce NePTune, a neuro-symbolic framework that overcomes these limitations through a hybrid execution model that integrates the perception capabilities of foundation vision models with the compositional expressiveness of symbolic reasoning. NePTune dynamically translates natural language queries into executable Python programs that blend imperative control flow with soft logic operators capable of reasoning over VLM-generated uncertainty. Operating in a training-free manner, NePTune, with a modular design, decouples perception from reasoning, yet its differentiable composition operations support fine-tuning. We evaluate NePTune on multiple visual reasoning benchmarks and various domains, including adversarial referring expression settings, and demonstrate significant improvements over base models, as well as effective compositional generalization and adaptation capabilities in novel environments. **Agentic Usage.** Modern AI agents are increasingly code-first: they plan, call tools, and write programs rather than produce a single monolithic text output. In this setting, NePTune provides a compositional visual reasoning interface. Not only does it provide visual concepts grounding, but its operators also allow the composition of concepts inside executable Python programs. This turns reasoning into a structured tool that agents can invoke, inspect, and refine. NePTune enables agents to decompose tasks, reuse intermediate results, and systematically combine perceptual grounding with symbolic reasoning. As a result, compositional reasoning over the visual modality becomes programmable, traceable, and compatible with broader agent pipelines. **Results.** As shown in the table, NePTune consistently outperforms direct backbone prompting in synthetic composition reasoning, human-generated real-world question answering, and real-image grounding tasks. Gains are especially strong on composition-heavy benchmarks and under domain shift, where the symbolic execution layer substantially improves robustness over monolithic VLM inference. In real-world referring expression grounding, NePTune is strong zero-shot and verification further boosts Ref-Adv. Furthermore, our analysis of CLEVR-Humans results reveals the effect of NePTune's imperative reasoning, which better handles linguistically diverse, multi-step questions than purely declarative baselines. NePTune's fine-tuning capacity enables fine-tuning a smaller backbone (1B) to achieve performance comparable to the 8B model on Ref-GTA and Ref-Adv, indicating that NePTune is both effective in zero-shot use and practical for out-of-distribution domain adaptation. | Model | CLEVR | CH | Ref-Adv | Ref-GTA | Ref | Puzzles | RPM | Avg. | |------|------|------|------|------|------|------|------|------| | InternVL2.5-8B | 90.25 | 85.95 | 76.13 | 6.95 | 27.00 | 52.00 | 47.00 | 55.04 | | + NePTune | 92.65 ↑2.40 | 87.67 ↑1.72 | 78.08 ↑1.95 | 69.69 ↑62.74 | 91.00 ↑64.00 | 81.00 ↑29.00 | 80.00 ↑33.00 | **82.87 (↑27.83)** | **Conclusion.** NePTune shows that separating perception from reasoning and symbolically composing concept uncertainty with executable Python programs enables robust compositional vision-language reasoning and improves performance on complex compositional queries and domain shift.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 81
Loading