Dynamic Regret of Adversarial Linear Mixture MDPs

Published: 21 Sept 2023, Last Modified: 14 Jan 2024NeurIPS 2023 posterEveryoneRevisionsBibTeX
Keywords: dynamic regret, adversarial MDPs, linear mixture MDPs, policy optimization
TL;DR: We study the dynamic regret of adversarial linear mixture MDPs with full-information rewards and the unknown transition kernel.
Abstract: We study reinforcement learning in episodic inhomogeneous MDPs with adversarial full-information rewards and the unknown transition kernel. We consider the linear mixture MDPs whose transition kernel is a linear mixture model and choose the \emph{dynamic regret} as the performance measure. Denote by $d$ the dimension of the feature mapping, $H$ the horizon, $K$ the number of episodes, $P_T$ the non-stationary measure, we propose a novel algorithm that enjoys an $\widetilde{\mathcal{O}}\big(\sqrt{d^2 H^3K} + \sqrt{H^4(K+P_T)(1+P_T)}\big)$ dynamic regret under the condition that $P_T$ is known, which improves previously best-known dynamic regret for adversarial linear mixture MDP and adversarial tabular MDPs. We also establish an $\Omega\big(\sqrt{d^2 H^3 K} + \sqrt{H K (H+P_T)}\big)$ lower bound, indicating our algorithm is \emph{optimal} in $K$ and $P_T$. Furthermore, when the non-stationary measure $P_T$ is unknown, we design an online ensemble algorithm with a meta-base structure, which is proved to achieve an $\widetilde{\mathcal{O}}\big(\sqrt{d^2 H^3K} + \sqrt{H^4(K+P_T)(1+P_T) + H^2 S_T^2}\big)$ dynamic regret and here $S_T$ is the expected switching number of the best base-learner. The result can be optimal under certain regimes.
Submission Number: 12066