Keywords: Reinforcement Learning, Large Language Model
TL;DR: We propose dynamic reinforcement learning, which takes a step to achieve the scalability of reinforcement learning for training a large language model by itself..
Abstract: The scaling laws constitute one of the fundamental principles of large language models (LLMs), which reveal that the model performance constantly improves as the training data increase. In this paper, we propose dynamic reinforcement learning (RL), which takes a step to achieve the scalability of RL for training LLMs by itself. Dynamic RL operates by sampling data from the dynamically changed LLM itself, estimating golden answers based on the model’s own outputs, and then using this self-generated data to optimize the model. Its dynamic characteristic allows the data distribution to continuously adapt to the evolving model, leading to better alignment between training data and model capabilities. Unlike conventional approaches, dynamic RL requires neither static, pre-collected datasets nor external verifiers for correctness. All is done by large language model itself. Experimental results demonstrate that dynamic RL can continually improve model performance over a thousand of training steps and achieve results comparable to models trained on large-scale external datasets.
Primary Area: reinforcement learning
Submission Number: 17904
Loading