Keywords: Meta reinforcement learning, Test-time regret
Abstract: Meta reinforcement learning sets a distribution over a set of tasks on which the agent can \emph{train} at will, then is asked to learn an optimal policy for any \emph{test} task efficiently. In this paper, we consider a \emph{finite} set of tasks modeled through Markov decision processes with various dynamics. We assume to have endured a long training phase, from which the set of tasks is perfectly recovered, and we focus on \emph{regret minimization} against the optimal policy in the unknown test task. Under a separation condition that states the existence of a state-action pair revealing a task against another, \citet{chen2021understanding} show that $O(M^2 \log(H))$ regret can be achieved, where $M, H$ are the number of tasks in the set and test episodes, respectively. In our main contribution, we demonstrate that the latter rate is nearly optimal by developing a novel \emph{lower bound} for test-time regret minimization under separation, showing that a linear dependence with $M$ is unavoidable. Our paper provides a new understanding of the statistical barriers of the deployment of a meta-trained agent.
Submission Number: 16
Loading