Track: Language Modeling
Keywords: Reinforcement learning, LLM, warm start, sample efficiency
TL;DR: LLM can zero-shot a sub-optimal policy for solving MDP, but it has good coverage, which can be used to warm start classical RL algorithms.
Abstract: We investigate the usage of Large Language Model (LLM) in collecting high-quality data to warm-start Reinforcement Learning (RL) algorithms for learning in some classical Markov Decision Process (MDP) environments. In this work, we focus on using LLM to generate an off-policy dataset that sufficiently covers
state-actions visited by optimal policies, then later using
an RL algorithm to explore the environment and improve the policy suggested by the LLM. Our algorithm, LORO, can both converge to an optimal policy and have a high sample efficiency thanks to the LLM's good starting policy.
On multiple OpenAI Gym environments, such as CartPole and Pendulum, we empirically demonstrate that LORO outperforms baseline algorithms such as pure LLM-based policies, pure RL, and a naive combination of the two, achieving up to $4 \times$ the cumulative rewards of the pure RL baseline.
Serve As Reviewer: ~Thang_Duong1, ~Chicheng_Zhang1
Submission Number: 7
Loading