Rank-Then-Act: Reward-Free Control from Frame-Order Progress

ICLR 2026 Conference Submission20725 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-language models, Reward-free training, Multi-step Agents
TL;DR: RTA learns control without extrinsic rewards by ranking shuffled expert-video frames and using Spearman progress–time correlation as reward.
Abstract: We introduce Rank-Then-Act (RTA), a novel reward-free control framework that enables policy learning without extrinsic task rewards. Instead, RTA uses a progress-percentage signal derived from expert video demonstrations (evaluated via rank correlation). Specifically, we train a Vision–Language Model (VLM) progress scorer offline with a Group Relative Policy Optimization (GRPO) objective, assigning progress percentages to shuffled frames from expert gameplay. This scorer is then frozen and used to provide feedback during reinforcement learning (RL). During policy learning, the agent receives as reward the Spearman correlation coefficient between the VLM scorer's predicted progress percentages for a window of recent observations and their true environment timestamps, yielding a bounded, time-aligned progress signal without explicit task rewards. On the PyBoy Catrap environment, RTA enables a VLM-based agent to solve levels using only expert videos, without any reward engineering. Our results demonstrate that training VLMs to act in games without extrinsic rewards is a promising and scalable direction for generalizing RL to real-world settings where reward specification is impractical or impossible.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 20725
Loading