TL;DR: We measure LLMs' (in)ability to make optimal decisions in bandits and evaluate a set of strategies to train LLMs to explore.
Abstract: Despite their success in many domains, large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. This is crucial as many real-world applications, ranging from personalized recommendations to healthcare interventions, demand that LLMs not only predict but also actively learn to make optimal decisions through exploration. In this work, we measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. We develop a comprehensive suite of environments, including both context-free and contextual bandits with varying task difficulties, to benchmark LLMs' performance. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs: by providing explicit algorithm-guided support during inference; and through algorithm distillation via in-context demonstrations and fine-tuning, using synthetic data generated from these algorithms. Impressively, these techniques allow us to achieve superior exploration performance with smaller models, surpassing larger models on various tasks. We conducted an extensive ablation study to shed light on various factors, such as task difficulty and data representation, that influence the efficiency of LLM exploration. Additionally, we conduct a rigorous analysis of the LLM's exploration efficiency using the concept of regret, linking its ability to explore to the model size and underlying algorithm.
Lay Summary: LLMs are in-context RL learners, but not great because they can’t explore well. How do we teach LLMs to explore better? 🤔
Solution: Supervised fine-tuning on full exploration trajectories. Exploration is crucial for ICRL. LLMs need to explore in context, without the need to retrain. In the paper, we tried:
- Summarized history
- Few-shot demonstrations
- Inference-time algorithm-guided support
- Full exploration trajectory fine-tuning
With full exploration trajectory fine-tuning (we dubbed OFT -- Oracle Behavior Fine-tuning), we first let a smaller model (65.6%) (Gemini Flash -- similar to GPT-4o-mini) surpass Gemini Pro (similar to GPT-4o) (60.0%) on one task.
Then, we almost closed the optimality gap between LLM and the theoretically optimal algorithm (LinUCB -- with no model misspecification) on another task (contextual bandit).
There are many caveats on how to represent the exploration history in text, how to pick training data, and when to introduce inference-time support. Read the paper for more details!! The two tasks we built are inspired by a one-step RL problem: the bandit. This problem setup is used to rigorously study algorithms like UCB; now, it can be used to study LLM's ability to explore!
Link To Code: https://github.com/allenanie/EVOLvE
Primary Area: Deep Learning->Large Language Models
Keywords: Exploration, In-Context Reinforcement Learning, Bandit
Submission Number: 8531
Loading