Policy-Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

Published: 28 Feb 2025, Last Modified: 25 Apr 2025WRL@ICLR 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Track: full paper
Keywords: reinforcement learning, offline RL, online fine-tuning, online RL, diffusion policies, foundation models, robotics
TL;DR: Fine-tuning multiple policy classes with Actor-Critic RL
Abstract: Recent successes in imitation learning have shown how critical it is to use expressive and multimodal policies. What would it take to replicate this success of better policy models in Reinforcement Learning (RL)? RL training of the best-performing policy models is challenging, as most deep RL machinery is co-developed with a specific policy class and backbone, resulting in poor performance when this synergy breaks. For e.g., SAC utilizes a policy gradient reparameterization for Gaussian policies, but this is unstable for diffusion policies and intractable for categorical policies. In this paper, we develop an approach called **policy-agnostic RL** (PA-RL) that can effectively train multiple policy classes with varying architectures and sizes, with offline RL and online fine-tuning methods. We build on the idea that supervised learning can replace the policy improvement step in RL, as long as it is applied on "optimized" actions. Concretely, we replace the policy improvement operator in RL with a supervised learning loss to imitate actions that maximize the critic value predictions, while staying close to the support of the data. Due to the universal nature of supervised learning, PA-RL is applicable to any policy model readily. Empirically, PA-RL enables fine-tuning continuous action diffusion and categorical autoregressive policies, entirely via actor-critic RL. PA-RL attains state-of-the-art results in simulation, and makes it possible, for the first time, to efficiently fine-tune OpenVLA, a 7B-parameter generalist robot policy, directly on a real robot.
Presenter: ~Max_Sobol_Mark1
Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 8
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview