Keywords: Theory, Reinforcement Learning Theory, Statistical Learning Theory, Reproducibility, Replicability
TL;DR: We design replicable algorithms for policy estimation in offline, infinite horizon, tabular, discounted Markov decision processes.
Abstract: We initiate the mathematical study of replicability as an
algorithmic property in the context of reinforcement learning (RL).
We focus on the fundamental setting of discounted tabular MDPs with access to a generative model.
Inspired by Impagliazzo et al. [2022], we say that an RL algorithm is replicable if,
with high probability,
it outputs the exact same policy
after two executions on i.i.d. samples drawn from the generator
when its internal randomness
is the same.
We first provide
an efficient $\rho$-replicable algorithm for $(\varepsilon, \delta)$-optimal policy estimation
with sample and time complexity $\widetilde O\left(\frac{N^3\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$,
where $N$ is the number of state-action pairs.
Next,
for the subclass of deterministic algorithms,
we provide a lower bound of order $\Omega\left(\frac{N^3}{(1-\gamma)^3\cdot\varepsilon^2\cdot\rho^2}\right)$.
Then, we study a relaxed version of replicability proposed
by Kalavasis et al. [2023] called TV indistinguishability.
We design a computationally efficient TV indistinguishable algorithm for policy estimation
whose sample complexity is $\widetilde O\left(\frac{N^2\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$.
At the cost of $\exp(N)$ running time,
we transform these TV indistinguishable algorithms to $\rho$-replicable ones without increasing their sample complexity.
Finally,
we introduce the notion of approximate-replicability
where we only require that two outputted policies are close
under an appropriate statistical divergence (e.g., Renyi)
and show an improved sample complexity of $\widetilde O\left(\frac{N\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$.
Submission Number: 1508
Loading