Sample-efficient Reinforcement Learning by Warm-starting with LLMs

Sample-efficient Reinforcement Learning by Warm-starting with LLMs

ICLR 2026 Conference Submission13766 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement learning, LLM, warm start, sample efficiency, coverage

TL;DR: LLM can zero-shot a sub-optimal policy for solving MDP, but it has good coverage, which can be used to warm start classical RL algorithms.

Abstract: We investigate the usage of Large Language Models (LLMs) in collecting high-quality data to warm-start Reinforcement Learning (RL) algorithms for learning in Markov Decision Processes (MDPs). Specifically, we leverage the in-context decision-making capability of LLMs, to generate an "offline" dataset that sufficiently covers state-actions visited by some good policy, then use an off-the-shelf RL algorithm to further explore the environment and fine-tune its policy, in a black-box manner. Our algorithm, LORO\footnote{The code of our experiments can be viewed at \url{https://anonymous.4open.science/r/LlamaGym-551D}}, can both converge to an optimal policy and have a high sample efficiency thanks to the good data coverage collected by the LLM. On multiple OpenAI Gym environments, such as CartPole and Pendulum, given the same environment interaction budget, we empirically demonstrate that LORO outperforms baseline algorithms such as pure LLM-based policies, pure RL, and a naive combination of the two.

Primary Area: reinforcement learning

Submission Number: 13766

Loading