Efficient Online Reinforcement Learning Fine-Tuning Should Not Retain Offline Data

ICLR 2025 Conference Submission13523 Authors

28 Sept 2024 (modified: 27 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement learning, fast fine-tuning
TL;DR: We find that previous RL fine-tuning methods fail because of Q-divergence, and propose a new method WSRL that can finetune with no data retention.
Abstract: The modern paradigm in machine learning involves pre-training models on diverse data, followed by task-specific fine-tuning. In reinforcement learning (RL), this translates to learning via offline RL on a static dataset, followed by rapid online RL fine-tuning using autonomous interaction data. Most RL fine-tuning methods require continued training on offline data for stability and performance. This is undesirable because retaining offline data is both slow and expensive for large datasets, but has been inevitable so far. In this paper, we show that retaining offline data is completely unnecessary as long as we use a correctly-designed online RL approach for fine-tuning offline RL initializations. We start by analyzing the role of retaining offline data in online fine-tuning. We find that continued training on offline data is mostly useful for preventing a sudden unlearning of the offline RL value function at the onset of fine-tuning, caused by a distribution mismatch between the offline data and online rollouts. As a result, this unlearning erases the benefits of offline pre-training. Our approach, WSRL, mitigates this sudden unlearning by using a warmup phase that seeds the online RL run with a very small number of rollouts from the pre-trained policy. The data collected during warmup helps ``recalibrate'' the offline Q-function to the online data better, allowing us to completely discard offline data without risking of destabilizing the online RL training. We show that WSRL is able to fine-tune without retaining any offline data, and is able to learn faster and attains higher performance than existing algorithms irrespective of whether they do or do not retain offline data.
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13523
Loading