Keywords: MLLM agent, programming, self reflection
Abstract: Despite rapid progress in vision–language models (VLMs), small VLMs still struggle to serve as effective agents capable of coherent multi-step tool use, especially in settings where fine-tuning is impractical due to data or cost constraints.
To address these limitations, we introduce Cross-Reflect, a training-free framework for reflection-guided trajectory optimization.
Cross-Reflect generates multiple candidate trajectories, applies structured reflection to critique and refine them, and performs cross-trajectory selection to identify the most reliable solution.
It is instantiated via an extension of the DSPy programming paradigm, which provides modular support for multimodal inputs.
Extensive experiments across static and dynamic knowledge-intensive VQA benchmarks demonstrate that Cross-Reflect consistently improves small VLMs by enabling flexible tool usage and trajectory-level self-reflection, achieving average relative improvements of 10.5\% for proprietary models and 28.1\% for open-source models over baseline methods.
Further analysis shows that our approach achieves comparable performance to methods requiring model fine-tuning, and even surpasses them in certain cases.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2806
Loading