Contextual Bandits with LLM-Derived Priors and Adaptive Calibration

20 Sept 2025 (modified: 23 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Contextual Bandits, Exploration–Exploitation, Online Decision-Making
Abstract: Large Language Models (LLMs) offer rich prior knowledge that can accelerate online decision-making, yet their use in contextual bandits lacks principled mechanisms for guiding exploration. We address this gap by proposing a lightweight framework that integrates LLM-derived priors with adaptive calibration in a multi-armed bandit setting. Our method first extracts prompt-based rewards from the LLM to provide task-specific supervision. We then construct an embedding-based estimator that quantifies uncertainty from the LLM’s representations, yielding a calibrated exploration signal. To remain robust under distribution shifts, we introduce an online contextual adapter that dynamically updates these uncertainty estimates during interaction. Experiments on LastFM and MovieLens-1M show that our method consistently outperforms both classical bandits and pure LLM-based agents, achieving higher cumulative rewards with significantly fewer LLM queries. Furthermore, we provide theoretical regret guarantees that establish improved sample efficiency compared to standard contextual bandits.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 23885
Loading