A Classification View on Meta Learning Bandits

Mirco Mutti; Jeongyeol Kwon; Shie Mannor; Aviv Tamar

A Classification View on Meta Learning Bandits

Mirco Mutti, Jeongyeol Kwon, Shie Mannor, Aviv Tamar

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Contextual multi-armed bandits are a popular choice to model sequential decision-making. *E.g.*, in a healthcare application we may perform various tests to asses a patient condition (exploration) and then decide on the best treatment to give (exploitation). When human design strategies, they aim for the exploration to be *fast*, since the patient's health is at stake, and easy to *interpret* for a physician overseeing the process. However, common bandit algorithms are nothing like that: The regret caused by exploration scales with $\sqrt{H}$ over $H$ rounds and decision strategies are based on opaque statistical considerations. In this paper, we use an original *classification view* to meta learn interpretable and fast exploration plans for a fixed collection of bandits $\mathbb{M}$. The plan is prescribed by an interpretable *decision tree* probing decisions' payoff to classify the test bandit. The test regret of the plan in the *stochastic* and *contextual* setting scales with $O (\lambda^{-2} C_{\lambda} (\mathbb{M}) \log^2 (MH))$, being $M$ the size of $\mathbb{M}$, $\lambda$ a separation parameter over the bandits, and $C_\lambda (\mathbb{M})$ a novel *classification-coefficient* that fundamentally links meta learning bandits with classification. Through a nearly matching lower bound, we show that $C_\lambda (\mathbb{M})$ inherently captures the complexity of the setting.

Lay Summary: In a healthcare application we may perform various tests to asses a patient condition and then decide on the best treatment to give. When human design the decision strategies, they aim for the assessment to be fast, since the patient's health is at stake, and easy to interpret for a physician overseeing the process. Instead, when the problem is tackled with AI, the resulting strategy is often opaque, hard to interpret for the end user. In this paper, we provide a template to compute decision strategies that are efficient and easy to interpret, which may be employed in the described healthcare scenario or other applications in which interpretability is important. Similar to the human approach, the strategy prescribes a sequence of simple tests to gather sufficient information, then to take optimal decisions with the given information. We analyize the proposed template both theoretically, through a formal study of its efficiency, and empirically, through numerical validation in synthetic domains. We believe our work is a first step in the direction of improving interpretability of decision strategies obtained with AI.

Link To Code: https://github.com/muttimirco/ece

Primary Area: Theory->Online Learning and Bandits

Keywords: Meta learning bandits, Multi-armed bandits, Contextual bandits, Regret minimization, Classification complexity

Submission Number: 2888

Loading