How Can LLM Guide RL? A Value-Based Approach

Shenao Zhang; Sirui Zheng; Shuqi Ke; Zhihan Liu; Wanxin Jin; Jianbo Yuan; Yingxiang Yang; Hongxia Yang; Zhaoran Wang

How Can LLM Guide RL? A Value-Based Approach

Shenao Zhang, Sirui Zheng, Shuqi Ke, Zhihan Liu, Wanxin Jin, Jianbo Yuan, Yingxiang Yang, Hongxia Yang, Zhaoran Wang

27 Sept 2024 (modified: 27 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Reinforcement Learning, Agent

Abstract: Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. However, RL algorithms may require extensive trial-and-error interactions to collect useful feedback for improvement. On the other hand, recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities for planning tasks, lacking the ability to autonomously refine their responses based on feedback. Therefore, in this paper, we study how the policy prior provided by the LLM can enhance the sample efficiency of RL algorithms. Specifically, we develop an algorithm named $\mathtt{LINVIT}$ that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning, particularly when the difference between the ideal policy and the LLM-informed policy is small, which suggests that the initial policy is close to optimal, reducing the need for further exploration. Additionally, we present a practical algorithm $\mathtt{SLINVIT}$ that simplifies the construction of the value function and employs sub-goals to reduce the search complexity. Our experiments across three interactive environments---ALFWorld, InterCode, and BlocksWorld---demonstrate that the proposed method achieves state-of-the-art success rates and also surpasses previous RL and LLM approaches in terms of sample efficiency.

Supplementary Material: zip

Primary Area: applications to robotics, autonomy, planning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11059

Loading