Is Your Imitation Learning Policy Better Than Mine? Policy Comparison with Near-Optimal Stopping

David Snyder; Asher James Hancock; Apurva Badithela; Emma Dixon; Patrick Miller; Rares Andrei Ambrus; Anirudha Majumdar; Masha Itkina; Haruki Nishimura

Is Your Imitation Learning Policy Better Than Mine? Policy Comparison with Near-Optimal Stopping

David Snyder, Asher James Hancock, Apurva Badithela, Emma Dixon, Patrick Miller, Rares Andrei Ambrus, Anirudha Majumdar, Masha Itkina, Haruki Nishimura

Published: 12 Jun 2025, Last Modified: 21 Jun 2025RobotEvaluation@RSS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Type: An evaluation-centric paper (focused on advancing methods and ideas for robot evaluation)

Keywords: Policy Comparison, Imitation Learning, Statistical Evaluation, Sequential Testing

TL;DR: Novel statistical method to rigorously compare imitation learning policies using a sequential procedure which preserves probabilistic correctness while requiring a near-minimal number of evaluation trials.

Abstract: Imitation learning has enabled robots to perform complex, long-horizon tasks in challenging dexterous manipulation settings. As new policies are developed, they must be rigorously evaluated and compared against corresponding baselines through repeated evaluation trials, which is a costly procedure. This paper proposes a novel statistical framework for rigorously comparing two policies in the small sample size regime. Prior work in statistical policy comparison relies on batch testing, which requires a fixed, pre-determined number of trials and lacks flexibility in adapting the sample size to the observed evaluation data. Furthermore, extending the test with additional trials risks inducing inadvertent p-hacking, undermining statistical assurances. In contrast, our proposed statistical test is sequential, allowing researchers to decide whether or not to run more trials based on intermediate results. This adaptively tailors the number of trials to the difficulty of the underlying comparison, saving significant time and effort without sacrificing probabilistic correctness. Extensive numerical simulation and real-world robot manipulation experiments show that our test achieves near-optimal stopping, letting researchers stop evaluation and make a decision in a near-minimal number of trials while preserving the probabilistic correctness and statistical power of the comparison.

Submission Number: 11

Loading