TTRV: Test-Time Reinforcement Learning for Vision Language Models

Akshit Singh; Shyam Marjit; Wei Lin; Paul Gavrikov; Serena Yeung-Levy; Hilde Kuehne; Rogerio Feris; Sivan Doveh; James R. Glass; Muhammad Jehanzeb Mirza

TTRV: Test-Time Reinforcement Learning for Vision Language Models

Akshit Singh, Shyam Marjit, Wei Lin, Paul Gavrikov, Serena Yeung-Levy, Hilde Kuehne, Rogerio Feris, Sivan Doveh, James R. Glass, Muhammad Jehanzeb Mirza

08 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: VLMs, LLMs, Test-time-adaptation, Reinforcement Learning

TL;DR: We introduce the first Test-time Reinforcement Learning Framework for Vision Language Models.

Abstract: Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision–language understanding by adapting the model on-the-fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to $52.4$% and $29.8$%, respectively, and average boosts of $24.6$% and $10.0$% across $16$ datasets. Remarkably, on image recognition, TTRV applied to Intern-VL-8B surpasses GPT-4o by an average of $2.3%$% over $8$ benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to $5.5$% in recognition tasks.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 3120

Loading