Keywords: Bounded regret, Linear Contextual bandit, Recurring arrivals, Counterfactual, Upper Confidence Bound
TL;DR: A recurring linear contextual bandit problem with $O(1)$ regret policy is proposed.
Abstract: Asymptotically unbounded regret of order $O(\sqrt{T})$ has been proved to be the lowest possible regret order that can be achieved in typical linear contextual bandit settings. Here we present a linear contextual bandit setting with repetitive arrivals of a set of agents where bounded, i.e., $O(1)$, expected regret can be achieved for each agent. We provide a novel Counterfactual UCB (CFUCB) policy where agents benefit from the experiences of other agents. It is shown that sharing of information is a Subgame Perfect Nash Equilibrium for the agents with respect to the order of the regret, which results in each agent realizing bounded regret. Personalized recommender systems and adaptive experimentation are two important applications.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Theory (eg, control theory, learning theory, algorithmic game theory)
1 Reply
Loading