Parameterization Agnostic RL

Max Sobol Mark; Tian Gao; Georgia Gabriela Sampaio; Mohan Kumar Srirama; Archit Sharma; Chelsea Finn; Aviral Kumar

Parameterization Agnostic RL

Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, Aviral Kumar

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, offline RL, online fine-tuning, online rl, diffusion policies

TL;DR: Fine-tuning multiple policy classes with Actor-Critic RL

Abstract: Recent advances in learning decision-making policies can largely be attributed to training expressive policy models, largely via imitation learning. While imitation discards non-expert data, offline and/or online fine-tuning via reinforcement learning (RL) can still learn from suboptimal data. However, instantiating RL training of a new policy class often presents a different challenge: most deep RL machinery is co-developed with assumptions on the policy class, resulting in poor performance when the policy class changes. For e.g., SAC utilizes a low-variance reparameterization policy gradient for Gaussian policies, but this is unstable for diffusion policies and intractable for autoregressive (e.g., transformer) categorical policies. This implies that current RL algorithms may not perform well or may even not be applicable when the policy class changes. To address this issue, we develop an offline RL and online fine-tuning approach called **parameterization-agnostic RL** (**PA-RL**) that can effectively train multiple policy classes, with varying architectures. The basic idea is that a universal supervised learning loss can replace the policy improvement step in RL, as long as it is applied on "optimized" actions. To obtain these optimized actions, we first sample multiple actions from a base policy, and run global optimization (i.e., re-ranking multiple action samples using the Q-function) and local optimization (i.e., running gradient steps on an action sample) to maximize the critic on these candidates. PA-RL enables fine-tuning diffusion and autoregressive policies entirely via RL, while improving performance and sample-efficiency compared to existing online RL fine-tuning methods. PA-RL allows us to successfully fine-tune diffusion policies and OpenVLA, a 7B parameter generalist robot policy, on real robots.

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12384

Loading