Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators

Published: 26 May 2026, Last Modified: 27 May 2026Real2Sim2RealEveryoneRevisionsCC BY 4.0
Reviewer: ~Apurva_Badithela2
Keywords: Real2Sim, Policy Evaluation, Rigorous statistical methods
TL;DR: We present a framework to combine large-scale simulation evaluations with small-scale hardware testing for reliable evaluation of real-world performance of robot manipulation policies.
Abstract: Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small number of hardware trials without any statistical assurances. We present SureSim, a framework to augment large-scale simulation with relatively small-scale real-world testing to provide reliable inferences on the real-world performance of a policy. Our key idea is to formalize the problem of combining real and simulation evaluations as a prediction-powered inference problem, in which a small number of paired real and simulation evaluations are used to rectify bias in large-scale simulation. We then leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance. Using physics-based simulation, we evaluate both diffusion policy and multi-task fine-tuned $\pi_0$ on a joint distribution of objects and initial conditions, and find that our approach saves over 20-25% of hardware evaluation effort to achieve similar bounds on policy performance.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
PDF: pdf
Submission Number: 25
Loading