LLM-Exp: Exploring the Policy in Reinforcement Learning with Large Language Models

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement learning, large language model, policy exploration
TL;DR: This paper propose a LLM-based method, which is compatible with all DQN-based RL algorithms, that enhances the efficiency of exploration in RL training.
Abstract: Policy exploration is critical in training reinforcement learning (RL) agents, where existing approaches include the $\epsilon$-greedy method in deep Q-learning, the Gaussian process in DDPG, etc. However, all these approaches are designed based on prefixed stochastic processes and are indiscriminately applied in all kinds of RL tasks without considering any environment-specific features that influence the policy exploration. Moreover, during the training process, the evolution of such stochastic process is rigid, which typically only incorporates a decay of the variance. This makes the policy exploration unable to adjust flexibly according to the agent's real-time learning status, limiting the performance. Inspired by the analyzing and reasoning capability of LLM that reaches success in a wide range of domains, we design $\textbf{LLM-Exp}$, which improves policy exploration in RL training with large language models (LLMs). During the RL training in a given environment, we sample a recent action-reward trajectory of the agent and prompt the LLM to analyze the agent's current policy learning status and then generate a probability distribution for future policy exploration. We update the probability distribution periodically and derive a stochastic process that is specialized for the particular environment, which can be dynamically adjusted to adapt to the learning process. Our approach is a simple plug-in design, which is compatible with DQN and any of its variants or improvements. Through extensive experiments on the Atari benchmark, we demonstrate the capability of LLM-Exp to enhance the performance of RL. Our code is open-source at https://anonymous.4open.science/r/LLM-Exp-4658 for reproducibility.
Supplementary Material: zip
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9650
Loading