Estimating Optimal Policy Value in Linear Contextual Bandits Beyond Gaussianity

Published: 19 Feb 2024, Last Modified: 19 Feb 2024Accepted by TMLREveryoneRevisionsBibTeX
Abstract: In many bandit problems, the maximal reward achievable by a policy is often unknown in advance. We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable. We refer to this as $V^*$ estimation. It was previously shown that fast $V^*$ estimation is possible but only in disjoint linear bandits with Gaussian covariates. Whether this is possible for more realistic context distributions has remained an open and important question for tasks such as model selection. In this paper, we first provide lower bounds showing that this general problem is hard. However, under stronger assumptions, we give an algorithm and analysis proving that $\widetilde{\mathcal{O}}(\sqrt{d})$ sublinear estimation of $V^*$ is indeed information-theoretically possible, where $d$ is the dimension. We subsequently introduce a practical and computationally efficient algorithm that estimates a problem-specific upper bound on $V^*$, valid for general distributions and tight for Gaussian context distributions. We prove our algorithm requires only $\widetilde{\mathcal{O}}(\sqrt{d})$ samples to estimate the upper bound. We use this upper bound in conjunction with the estimator to derive novel and improved guarantees for several applications in bandit model selection and testing for treatment effects. We present promising experimental benefits on a semi-synthetic simulation using historical data on warfarin treatment dosage outcomes.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Gergely_Neu1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 1569