LLMs Are In-Context Bandit Reinforcement Learners

Giovanni Monea; Antoine Bosselut; Kianté Brantley; Yoav Artzi

LLMs Are In-Context Bandit Reinforcement Learners

Giovanni Monea, Antoine Bosselut, Kianté Brantley, Yoav Artzi

Published: 08 Jul 2025, Last Modified: 29 Sept 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: In-context reinforcement learning, in-context learning, contextual bandits, online learning, large language models

TL;DR: LLMs can learn in-context from online rewards like in reinforcement learning, instead of just supervised examples

Abstract: Large Language Models (LLMs) excel at in-context learning (ICL), a supervised learning technique that relies on adding annotated examples to the model context. We investigate a contextual bandit version of in-context reinforcement learning (ICRL), where models learn in-context, online, from external reward, instead of supervised data. We show that LLMs effectively demonstrate such learning, and provide a detailed study of the phenomena, experimenting with challenging classification tasks and models of sizes from 500M to 70B parameters. This includes identifying and addressing the instability of the process, demonstrating learning with both semantic and abstract labels, and showing scaling trends. Our findings highlight ICRL capabilities in LLMs, while also underscoring fundamental limitations in their implicit reasoning about errors.

Supplementary Material: zip

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 439

Loading