Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection

Abrar Anwar; Rohan Gupta; Zain Merchant; Sayan Ghosh; Willie Neiswanger; Jesse Thomason

Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection

Abrar Anwar, Rohan Gupta, Zain Merchant, Sayan Ghosh, Willie Neiswanger, Jesse Thomason

Published: 12 Jun 2025, Last Modified: 20 Jun 2025RobotEvaluation@RSS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Type: An evaluation-centric paper (focused on advancing methods and ideas for robot evaluation)

Keywords: efficient evaluation, active testing, manipulation

TL;DR: We introduce a framework to model the distribution of robot performance across tasks and policies during evaluation, which enables us to actively select informative experiments in a cost-aware manner.

Abstract: Evaluating learned robot control policies to determine their performance costs the experimenter time and effort. As robots become more capable in accomplishing diverse tasks, evaluating across all these tasks becomes more difficult as it is impractical to test every policy on every task multiple times. Rather than considering the average performance of a policy on a task, we consider the distribution of performance over time. In a multi-task policy evaluation setting, we actively model the distribution of robot performance across multiple tasks and policies as we sequentially execute experiments. We show that natural language is a useful prior in modeling relationships between tasks because they often share similarities that can reveal potential relationships in policy behavior. We leverage this formulation to reduce experimenter effort by using a cost-aware information gain heuristic to efficiently select informative trials. We conduct experiments on existing evaluation data from real robots and simulations and find a 50% reduction in estimates of the mean performance given a fixed cost budget. We encourage the use of our surrogate model as a scalable approach to track progress in evaluation.

Submission Number: 5

Loading