Cross-Reflect: Empowering Multi-Modal Agents with Joint Reasoning Across Trajectories

Liangbing Zhao; Wenxuan Zhang; Minhao Fan; Isaac Miller; Omar Khattab; Mohamed Elhoseiny

Cross-Reflect: Empowering Multi-Modal Agents with Joint Reasoning Across Trajectories

Liangbing Zhao, Wenxuan Zhang, Minhao Fan, Isaac Miller, Omar Khattab, Mohamed Elhoseiny

07 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: MLLM agent, programming, self reflection

Abstract: Despite rapid progress in vision–language models (VLMs), small VLMs still struggle to serve as effective agents capable of coherent multi-step tool use, especially in settings where fine-tuning is impractical due to data or cost constraints. To address these limitations, we introduce Cross-Reflect, a training-free framework for reflection-guided trajectory optimization. Cross-Reflect generates multiple candidate trajectories, applies structured reflection to critique and refine them, and performs cross-trajectory selection to identify the most reliable solution. It is instantiated via an extension of the DSPy programming paradigm, which provides modular support for multimodal inputs. Extensive experiments across static and dynamic knowledge-intensive VQA benchmarks demonstrate that Cross-Reflect consistently improves small VLMs by enabling flexible tool usage and trajectory-level self-reflection, achieving average relative improvements of 10.5\% for proprietary models and 28.1\% for open-source models over baseline methods. Further analysis shows that our approach achieves comparable performance to methods requiring model fine-tuning, and even surpasses them in certain cases.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 2806

Loading