Keywords: Large Vision-Language Model;Multimodal Reasoning;Test-Time Training;Meta-training
TL;DR: We introduce RAM-TTT, a retrieval-augmented meta test-time training framework that lets LVLMs learn from retrieved examples while avoiding overfitting, enabling stronger reasoning under distribution shifts.
Abstract: Reasoning lies at the core of Large Vision–Language Models (LVLMs). Recent Test-Time Scaling (TTS) methods enhance reasoning by allocating additional computation during inference. However, they primarily exploit the model’s internal knowledge without incorporating new information, which limits their effectiveness under distribution shifts.
While retrieval can introduce new knowledge, existing methods primarily emphasize semantic similarity rather than reasoning utility, leaving LVLMs struggle to effectively leverage the retrieved examples for complex reasoning.
To address these limitations, we propose RAM-TTT (Retrieval-Augmented Meta Test-Time Training), a retrieve–train–generate framework that unifies retrieval with meta-adaptation. RAM-TTT includes two key components: LVLM Aligned Retrieval (LAR), which selects examples for both semantic relevance and reasoning utility, and Meta Test-Time Training (Meta TTT), which casts retrieved examples as alternating support sets and meta-queries, allowing the model to ``learn how to learn'' from external information while mitigating overfitting.
Experiments show consistent gains on MathVerse (+6.4\%), LogicVista (+5.6\%), and We-Math (+8.5\%) with Qwen2-VL-7B, and strong generalization to Phi-3.5-Vision and Pixtral-12B. These results highlight RAM-TTT’s broad applicability in enabling LVLMs to acquire and internalize new information at test time for stronger reasoning under distribution shifts.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 10131
Loading