Learning Adversarial Linear Mixture Markov Decision Processes with Bandit Feedback and Unknown TransitionDownload PDF

Published: 01 Feb 2023, Last Modified: 28 Feb 2023ICLR 2023 posterReaders: Everyone
Keywords: Reinforcement learning theory, Reinforcement learning with adversarial losses, Reinforcement learning with linear function approximation
TL;DR: We make the first step to establish a provably efficient algorithm in adversarial linear mixture mdps with bandit feedback and unknown transition.
Abstract: We study reinforcement learning (RL) with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, the unknown transition probability function is a linear mixture model \citep{AyoubJSWY20,ZhouGS21,HeZG22} with a given feature mapping, and the learner only observes the losses of the experienced state-action pairs instead of the whole loss function. We propose an efficient algorithm LSUOB-REPS which achieves $\widetilde{O}(dS^2\sqrt{K}+\sqrt{HSAK})$ regret guarantee with high probability, where $d$ is the ambient dimension of the feature mapping, $S$ is the size of the state space, $A$ is the size of the action space, $H$ is the episode length and $K$ is the number of episodes. Furthermore, we also prove a lower bound of order $\Omega(dH\sqrt{K}+\sqrt{HSAK})$ for this setting. To the best of our knowledge, we make the first step to establish a provably efficient algorithm with a sublinear regret guarantee in this challenging setting and solve the open problem of \citet{HeZG22}.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)
Supplementary Material: zip
18 Replies