Retrieval-Augmented Meta Test-Time Training for Multimodal Reasoning

Shengjie Jin; Zelong Sun; Hengbo Xu; Yuyao Zhang; Dong Jing; Zhiwu Lu

Retrieval-Augmented Meta Test-Time Training for Multimodal Reasoning

Shengjie Jin, Zelong Sun, Hengbo Xu, Yuyao Zhang, Dong Jing, Zhiwu Lu

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Vision-Language Model;Multimodal Reasoning;Test-Time Training;Meta-training

TL;DR: We introduce RAM-TTT, a retrieval-augmented meta test-time training framework that lets LVLMs learn from retrieved examples while avoiding overfitting, enabling stronger reasoning under distribution shifts.

Abstract: Reasoning lies at the core of Large Vision–Language Models (LVLMs). Recent Test-Time Scaling (TTS) methods enhance reasoning by allocating additional computation during inference. However, they primarily exploit the model’s internal knowledge without incorporating new information, which limits their effectiveness under distribution shifts. While retrieval can introduce new knowledge, existing methods primarily emphasize semantic similarity rather than reasoning utility, leaving LVLMs struggle to effectively leverage the retrieved examples for complex reasoning. To address these limitations, we propose RAM-TTT (Retrieval-Augmented Meta Test-Time Training), a retrieve–train–generate framework that unifies retrieval with meta-adaptation. RAM-TTT includes two key components: LVLM Aligned Retrieval (LAR), which selects examples for both semantic relevance and reasoning utility, and Meta Test-Time Training (Meta TTT), which casts retrieved examples as alternating support sets and meta-queries, allowing the model to ``learn how to learn'' from external information while mitigating overfitting. Experiments show consistent gains on MathVerse (+6.4\%), LogicVista (+5.6\%), and We-Math (+8.5\%) with Qwen2-VL-7B, and strong generalization to Phi-3.5-Vision and Pixtral-12B. These results highlight RAM-TTT’s broad applicability in enabling LVLMs to acquire and internalize new information at test time for stronger reasoning under distribution shifts.

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 10131

Loading