No-regret Learning with Revealed Transitions in Adversarial Markov Decision Processes

23 Sept 2024 (modified: 20 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Adversarial Markov Decision Processes, Reinforcement Learning, Online Learning
Abstract: When learning in Adversarial Markov Decision Processes (MDPs), agents must deal with a sequence of arbitrarily chosen transition models and losses. In this paper, we consider the setting in which the transition model chosen by the adversary is revealed at the end of each episode. We propose the notion of smoothed MDP whose transition model aggregates with a generic function $f_t$ the ones experienced so far. Coherently, we define the concept of smoothed regret, and we devise Smoothed Online Mirror Descent (SOMD), an enhanced version of OMD that leverages a novel regularization term to effectively learn in this setting. For specific choices of the aggregation function $f_t$ defining the smoothed MDPs we retrieve, under full-feedback, a regret bound of order $\widetilde{\mathcal O}(L^{3/2}\sqrt{TL}+L\overline{C}_f^{\mathsf{P}})$ where $T$ is the number of episodes, $L$ is the horizon of the episode, and $\overline{C}_f^{\mathsf{P}}$ is a novel index of the degree of maliciousness of the adversarially chosen transitions. Under bandit feedback on the losses, we obtain a bound of order $\widetilde{\mathcal O}(L^{3/2}\sqrt{XAT}+L\overline{C}_f^{\mathsf{P}})$ using a simple importance weighted estimator on the losses.
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3117
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview