Keywords: Generalization, Statistical Testing, Verification
TL;DR: We place probabilistic bounds on the full performance distribution of a learned policy deployed in a new environment while minimizing the number of policy rollouts needed to construct the bound.
Abstract: We present tools to bound the generalization performance of stochastic imitation learning policies deployed in novel environments. As is common in settings of robot learning from demonstrations, we assume no access to an observation or state transition model. We only have access to a small number of experimental rollouts of the policy and a performance score with which to measure the policy's success on those rollouts. Given this finite sample of performance scores, we propose a worst-case bound on the full probability distribution over the performance score. We give bounds for two kinds of performance metrics: binary task success and continuous-valued total reward. Our bounds hold with a user-specified confidence level, a user-specified tightness, and are constructed from as few rollouts in the new environment as possible. To accomplish this we build upon classical methods for constructing confidence sets without access to an underlying probability model. By defining a partial order over cumulative distributions of the performance score, we obtain confidence bounds on the full cumulative distribution of performance (from which one can obtain expected value bounds, quantile bounds, and numerous other bounds). We apply our approach to assess the generalization of a diffusion policy for visuomotor manipulation, where we find (potentially counter-intuitively) that the policy gives strong performance under a visually large domain shift, but weak performance under a smaller shift.
Submission Number: 20
Loading