SPEQ: Offline Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning

Published: 09 May 2025, Last Modified: 28 May 2025RLC 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: reinforcement learning, offline reinforcement learning, computational efficiency
TL;DR: We propose SPEQ (Offline Stabilization Phases for Efficient Q-Learning) that combines low UTD online training with periodic offline stabilization phases to provide a significantly more computationally efficient alternative to high UTD SOTA solutions.
Abstract: High update-to-data (UTD) ratio algorithms in reinforcement learning (RL) improve sample efficiency but incur high computational costs, limiting real-world scalability. We propose Offline Stabilization Phases for Efficient Q-Learning (SPEQ), an RL algorithm that combines low-UTD online training with periodic offline stabilization phases. During these phases, Q-functions are fine-tuned with high UTD ratios on a fixed replay buffer, reducing redundant updates on suboptimal data. This structured training schedule optimally balances computational and sample efficiency, addressing the limitations of both high and low UTD ratio approaches. We empirically demonstrate that SPEQ requires from 40\% to 99\% fewer gradient updates and 27\% to 78\% less training time compared to state-of-the-art high UTD ratio methods while maintaining or surpassing their performance on the MuJoCo continuous control benchmark. Our findings highlight the potential of periodic stabilization phases as an effective alternative to conventional training schedules, paving the way for more scalable reinforcement learning solutions in real-world applications where computational resources are constrained.
Submission Number: 196
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview