Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

Published: 01 Jan 2024, Last Modified: 17 Feb 2025ICRA 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Diffusion models have been adopted for behavioral cloning in a sequence modeling fashion, benefiting from their exceptional capabilities in modeling complex data distributions. The standard diffusion-based policy iteratively denoises action sequences from random noise conditioned on the input states and the model is typically trained with a singular diffusion loss. This paper explores the potential enhancements in such models when the denoising process is informed by a better visual representation. We study the scenario where the model is jointly optimized using the standard diffusion loss alongside an auxiliary objective based on self-supervised learning. After experimenting with various objectives, we introduce Crossway Diffusion, a simple yet effective way to enhance diffusion-based visuomotor policy learning via a state decoder and an auxiliary reconstruction objective. During training, the state decoder reconstructs raw image pixels and other states from the intermediate representations of the model. Experiments demonstrate the effectiveness of our method in various simulated and real-world tasks, confirming its consistent advantages over the standard diffusion-based policy and other baselines.
Loading