Reliably Augmenting Real-World Tests with Simulation for Scalable Robot Policy Evaluation

Published: 22 Nov 2025, Last Modified: 22 Nov 2025SAFE-ROL OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Evaluation, Finite-Sample Statistical Inferences, Real2Sim
Abstract: As robot foundation models are becoming increasingly capable of performing complex manipulation tasks in a diverse set of environments, rigorous evaluation of these learned policies is crucial for assessing performance and guiding improvement. However, these policies are often evaluated on fewer than 50 real-world trials, making it challenging to confidently assess their performance on metrics such as success rate. We present a framework to augments real-world evaluations with large-scale simulations to provide stronger inferences on real-world policy performance instead of scaling up real-world evaluations. Our pipeline results in confidence intervals that are non-asymptotically valid, and save up to 20% of hardware evaluation cost.
Submission Number: 28
Loading