Single-Sample Test-Time Reinforcement Learning for Vision-Language Models

Single-Sample Test-Time Reinforcement Learning for Vision-Language Models

ICLR 2026 Conference Submission12579 Authors

18 Sept 2025 (modified: 02 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Test-Time Reinforcement Learning, Vision-Language Models, Single-Sample Optimization, Visual Reasoning, Test-Time Adaptation, Majority Voting, Pseudo-labeling, Segmentation, Object Counting, Self-Supervised Learning, Chain-of-Thought, Group Relative Policy Optimization

TL;DR: We introduce Vision Reasoning Test-Time Reinforcement Learning (VR-TTRL), a test-time reinforcement learning framework that adapts a model from its own predictions on a single unlabeled sample for vision-language models for visual reasoning tasks.

Abstract: While Test-Time Reinforcement Learning (TTRL) has shown promise for adapting language models without ground truth answers, its application to vision-language tasks remains unexplored. Similarly, existing TTRL methods require multiple samples or known answers for optimization, limiting their practical applicability. We introduce Vision Reasoning Test-Time Reinforcement Learning (VR-TTRL), to our knowledge, the first framework to apply TTRL to vision-language models for visual reasoning tasks, enabling adaptation from a single unlabeled sample without any ground truth answers. Our approach leverages majority voting across model rollouts to generate pseudo-labels for self-supervision, combining the structured reasoning capabilities of vision-language models with the adaptive power of test-time reinforcement learning. Through experiments on segmentation and counting tasks, we demonstrate that VR-TTRL enables effective model adaptation using only a single unlabeled sample, achieving performance improvements over state-of-the-art baselines. This work suggests promising directions for further improving vision task performance through self-supervised adaptation and enabling models to better leverage their pre-trained capabilities during inference.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 12579

Loading