Keywords: Contextual Bandits, Exploration–Exploitation, Online Decision-Making
Abstract: Large Language Models (LLMs) offer rich prior knowledge that can accelerate online decision-making, yet their use in contextual bandits lacks principled mechanisms for guiding exploration. We address this gap by proposing a lightweight framework that integrates LLM-derived priors with adaptive calibration in a multi-armed bandit setting. Our method first extracts prompt-based rewards from the LLM to provide task-specific supervision. We then construct an embedding-based estimator that quantifies uncertainty from the LLM’s representations, yielding a calibrated exploration signal. To remain robust under distribution shifts, we introduce an online contextual adapter that dynamically updates these uncertainty estimates during interaction. Experiments on LastFM and MovieLens-1M show that our method consistently outperforms both classical bandits and pure LLM-based agents, achieving higher cumulative rewards with significantly fewer LLM queries. Furthermore, we provide theoretical regret guarantees that establish improved sample efficiency compared to standard contextual bandits.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 23885
Loading