Keywords: robot learning, shortcut learning, OOD generalization, data augmentation, generative models
TL;DR: We introduce CIFT, a framework that pairs our multi-view synthesis engine, MVAug, with a novel Information Fidelity metric to optimally compose augmented data, mitigating shortcut learning in robot policies.
Abstract: Generalist robot policies trained on large-scale, visually homogeneous datasets can be susceptible to shortcut learning, which impairs their out-of-distribution (OOD) generalization. While generative data augmentation is a common approach to introduce diversity, it presents a subtle challenge: data composition. Naively mixing real and synthetic data can corrupt the learning signal, as this process often prioritizes visual diversity at the expense of information fidelity. This paper suggests that robust generalization depends on principled, fidelity-aware data composition. We introduce Coherent Information Fidelity Tuning (CIFT), a framework that treats data composition as an optimization problem. CIFT uses a practical proxy for Information Fidelity based on the feature-space geometry of a dataset. This enables the identification of a phase transition, termed the Decoherence Point, where training stability degrades. The framework includes a generative engine, Multi-View Video Augmentation (MVAug), to synthesize a causally disentangled data spectrum for this tuning process. Applying CIFT to policy architectures such as $\pi_0$ and GE-Act improves OOD success rates by over 54\%.
The datasets used in this study are available in the anonymous repository provided. All model checkpoints will be released in a public repository after the review process to facilitate reproducibility. The anonymous code repository is available at: https://anonymous.4open.science/r/CIFT-code.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 10739
Loading