Improving the Data-efficiency of Reinforcement Learning by Warm-starting with LLM

Thang Duong; Minglai Yang; Chicheng Zhang

Improving the Data-efficiency of Reinforcement Learning by Warm-starting with LLM

Thang Duong, Minglai Yang, Chicheng Zhang

Published: 12 Jun 2025, Last Modified: 09 Jul 2025EXAIT@ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Language Modeling

Keywords: Reinforcement learning, LLM, warm start, sample efficiency

TL;DR: LLM can zero-shot a sub-optimal policy for solving MDP, but it has good coverage, which can be used to warm start classical RL algorithms.

Abstract: We investigate the usage of Large Language Model (LLM) in collecting high-quality data to warm-start Reinforcement Learning (RL) algorithms for learning in some classical Markov Decision Process (MDP) environments. In this work, we focus on using LLM to generate an off-policy dataset that sufficiently covers state-actions visited by optimal policies, then later using an RL algorithm to explore the environment and improve the policy suggested by the LLM. Our algorithm, LORO, can both converge to an optimal policy and have a high sample efficiency thanks to the LLM's good starting policy. On multiple OpenAI Gym environments, such as CartPole and Pendulum, we empirically demonstrate that LORO outperforms baseline algorithms such as pure LLM-based policies, pure RL, and a naive combination of the two, achieving up to $4 \times$ the cumulative rewards of the pure RL baseline.

Serve As Reviewer: ~Thang_Duong1, ~Chicheng_Zhang1

Submission Number: 7

Loading