Keywords: offline imitation learning, behavior cloning, data attribution, data curation, demonstration pruning, TRAK, robot learning, offline-to-online, evaluation protocols, RoboMimic
TL;DR: A pruning method can beat matched random and still lose to a cheap baseline. We propose a four-gate offline-to-online audit; TRAK-Traj passes gate 1 on curated Can but trajectory length is stronger.
Abstract: Pruning offline demonstrations is a deployment decision: a scoring rule can beat random pruning while still losing to a cheap baseline that would be the better online choice. We present a four-gate offline-to-online audit for pruning methods: paired random comparison, cheap-baseline audit, held-out stress test, and mechanism check. As a case study, TRAK-Traj, a trajectory-level adaptation of TRAK-style attribution, beats matched random pruning on curated RoboMimic Can MH by +4.7 percentage points across 10 paired seeds and 300 rollouts per condition (9/10 wins; paired t(9) = 2.43, one-sided p = 0.019). But the audit changes the deployment recommendation: TracIn-style scoring ties TRAK, trajectory length is stronger on curated Can, and a held-out mixed-quality block shows length outperforming both TRAK and random. Mechanism analysis explains why: length pruning removes 260 worse-tier trajectories out of 270 pruned in the mixed-quality split. The contribution is a reproducible audit template for data-curation claims before robot deployment.
Submission Number: 42
Loading