TL;DR: We establish that value-based online RL can be scaled predictably to larger data, larger compute, or generally larger budget
Abstract: Scaling data and compute is critical in modern machine learning. However, scaling also demands _predictability_: we want methods to not only perform well with more compute or data, but also have their performance be predictable from low compute or low data runs, without ever running the large-scale experiment. In this paper, we show predictability of value-based off-policy deep RL. First, we show that data and compute requirements to reach a given performance level lie on a _Pareto frontier_, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can extrapolate data requirements into a higher compute regime, and compute requirements into a higher data regime. Second, we determine the optimal allocation of total _budget_ across data and compute to obtain given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling behavior is enabled by first estimating predictable relationships between different _hyperparameters_, which is used to counteract effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance.
Lay Summary: A reinforcement learning agent is an AI that takes actions and makes decisions. Robots and 'thinking' LLMs are reinforcement learning agents. However, these are trained using expensive techniques relying on 'policy gradients', whereas in this paper we study scaling value-based RL that could make these more efficient and versatile. We study how to train large-scale value-based reinforcement learning agents by providing rules on how to select the amount of resources spent, such as data and compute. We observe that such rules are possible to establish with cheap experiments, which improves the performance of the larger scale, more expensive experiment.
Primary Area: Reinforcement Learning->Deep RL
Keywords: scaling laws, online reinforcement learning, q-learning
Submission Number: 13569
Loading