Large Language Model-Enhanced Multi-Armed Bandits

ICLR 2026 Conference Submission18330 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-Armed Bandits, Sequential Decision-Making
TL;DR: We propose to adopt classical MAB algorithms as the high-level framework for sequential decision-making and leverage the in-context learning capability of LLMs for reward prediction.
Abstract: Large language models (LLMs) have been adopted to solve sequential decision-making tasks such as multi-armed bandits (MAB), in which an LLM is directly instructed to select the arms to pull in every iteration. However, this paradigm of direct arm selection using LLMs has been shown to be suboptimal in many MAB tasks. Therefore, we propose an alternative approach which combines the strengths of classical MAB and LLMs. Specifically, we adopt a classical MAB algorithm as the high-level framework and leverage the strong in-context learning capability of LLMs to perform the sub-task of reward prediction. Firstly, we incorporate the LLM-based reward predictor into the classical Thompson sampling (TS) algorithm and adopt a decaying schedule for the LLM temperature to ensure a transition from exploration to exploitation. Next, we incorporate the LLM-based reward predictor (with a temperature of 0) into a regression oracle-based MAB algorithm equipped with an explicit exploration mechanism. We also extend our TS-based algorithm to dueling bandits where only the preference feedback between pairs of arms is available, which requires non-trivial algorithmic modifications. We firstly conduct empirical evaluations on synthetic MAB tasks, where the results show that our algorithms consistently outperform LLM-based direct arm selection. Additionally, we perform experiments using real-world text datasets, in which the results demonstrate that in challenging tasks where the arms lack semantic meanings that can be exploited by the LLM, our approach delivers significantly better performance than LLM-based direct arm selection.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 18330
Loading