Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback

Published: 16 Jan 2024, Last Modified: 13 Apr 2024ICLR 2024 spotlightEveryoneRevisionsBibTeX
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: adversarial MDPs, policy optimization, bandit feedback
Submission Guidelines: I certify that this submission complies with the submission instructions as described on
TL;DR: We study online reinformencent learning in adversarial linear MDPs, proposing the first rate-optimal inefficent algorithm and an efficient algorithm that significantly improves prior results.
Abstract: We study online reinforcement learning in linear Markov decision processes with adversarial losses and bandit feedback. We introduce two algorithms that achieve improved regret performance compared to existing approaches. The first algorithm, although computationally inefficient, achieves a regret of $\widetilde{O}(\sqrt{K})$ without relying on simulators, where $K$ is the number of episodes. This is the first rate-optimal result in the considered setting. The second algorithm is computationally efficient and achieves a regret of $\widetilde{O}(K^{\frac{3}{4}})$ . These results significantly improve over the prior state-of-the-art: a computationally inefficient algorithm by Kong et al. (2023) with $\widetilde{O}(K^{\frac{4}{5}}+1/\lambda_{\min})$ regret, and a computationally efficient algorithm by Sherman et al. (2023b) with $\widetilde{O}(K^{\frac{6}{7}})$ regret.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Primary Area: reinforcement learning
Submission Number: 7647