Provably Efficient Linear Bandits with Instantaneous Constraints in Non-Convex Feature Spaces

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Linear Bandits, Non-convex feature spaces, Instantaneous hard constraints, Safety, UCB
Abstract: In linear stochastic bandits, tasks with instantaneous hard constraints present significant challenges, particularly when the feature space is non-convex or discrete. This is especially relevant in applications such as financial management, recommendation systems, and medical treatment selection, where safety constraints appear in non-convex forms or where decisions must often be made within non-convex and discrete sets. In these systems, bandit methods rely on the ability of feature functions to extract critical features. However, in contrast to the star-convexity assumption commonly discussed in the literature, these feature functions often lead to non-convex and more complex feature spaces. In this paper, we investigate linear bandits and introduce a method that operates effectively in a non-convex feature space while satisfying instantaneous hard constraints at each time step. We demonstrate that our method, with high probability, achieves a regret of $\tilde{\mathcal{O}}\big( d (1+\frac{\tau}{\epsilon \iota}) \sqrt{T}\big)$ and meets the instantaneous hard constraints, where $d$ represents the feature space dimension, $T$ the total number of rounds, and $\tau$ a safety related parameter. The constant parameters $\epsilon$ and $\iota$ are related to our localized assumptions around the origin and the optimal point. In contrast, standard safe linear bandit algorithms that rely on the star-convexity assumption often result in linear regret. Furthermore, our approach handles discrete action spaces while maintaining a comparable regret bound. Moreover, we establish an information-theoretic lower bound on the regret of $\Omega \left( \max\{ d \sqrt{T}, \frac{1}{\epsilon \iota^2} \} \right)$ for $T \geq \frac{32 e}{\epsilon \iota^2}$, emphasizing the critical role of $\epsilon$ and $\iota$ in the regret upper bound. Lastly, we provide numerical results to validate our theoretical findings.
Supplementary Material: zip
Primary Area: learning theory
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8177
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview