Bad Values but Good Behavior: Learning Highly Misspecified Bandits with Function Approximation

Published: 03 Feb 2026, Last Modified: 03 Feb 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We develop some structural conditions under which we show that misspecified bandits can learn optimally.
Abstract: Function approximation with parametric, feature-based reward models is widely used to enable decision-making in bandits with large action spaces. While bandit learning is well understood in the case of little or no misspecification in the reward approximation, real-world applications can often involve significantly high model misspecification. We study whether optimal learning is still possible under arbitrary misspecification. We identify structural, instance-dependent conditions, determined jointly by the problem instance and model class, under which standard algorithms like $\epsilon$-greedy and LinUCB achieve sublinear regret, despite an arbitrarily large misspecification error in the traditional sense. These results contrast sharply with worst-case analyses that predict linear regret, and show that a broad class of instances remains robust to model error. Our findings offer a theoretical explanation for the empirical success of approximate value-based methods in complex environments.
Submission Number: 821
Loading