Keywords: Large Language Model, Exploration
TL;DR: We measure LLMs' (in)ability to make optimal decisions in bandits and evaluate a set of strategies to train LLMs to explore.
Abstract: Despite their success in many domains, large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. This is crucial as many real-world applications, ranging from personalized recommendations to healthcare interventions, demand that LLMs not only predict but also actively learn to make optimal decisions through exploration.
In this work, we measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. We develop a comprehensive suite of environments that include both context-free and contextual bandits of varying task difficulties to benchmark LLMs' performance. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs: by providing explicit algorithmic guided support during inference; and through knowledge distillation via in-context demonstrations and fine-tuning, using synthetic data generated from these algorithms.
Impressively, these techniques allow us to achieve superior exploration performance with smaller models, surpassing larger models on various tasks. We conducted an extensive ablation study to shed light on the different factors, such as task difficulty and data representations, that influence the efficiency of LLM exploration. Additionally, we provide empirical measurements on the convergence rate of different exploration strategies introduced.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9499
Loading