Keywords: Hybrid soft actor-critic, online diffusion policy learning, maximum entropy diffusion policy, non-prehensile manipulation
Abstract: Learning diverse policies for non-prehensile manipulation of objects can potentially improve skill transfer and generalization to out-of-distribution scenarios and unseen objects. In this work, we propose an innovative approach to learning versatile 6D non-prehensile manipulation policies by introducing a new objective function based on entropy maximization terms. This allows for simultaneous exploration of discrete and continuous action spaces, such as contact location and motion parameter spaces. To further enhance the diversity of the agent's policy, we represent a continuous motion parameter policy as a diffusion model and derive the maximum entropy objective for optimizing diffusion policies as the lower bound of the maximum reward likelihood using structured variational inference. As a result, we introduce the hybrid soft actor-critic with diffusion policy algorithm (Diff-HySAC). We evaluate the benefit of adding maximum entropy regularization and diffusion on both simulation and zero-shot sim2real tasks. The performance show this combination helps learn more diverse behavior policies. In zero-shot sim2real transfer, the improvement is larger, i.e. increase from 53\% to 72\% success rate on the 6D object pose alignment task.
Supplementary Material: zip
Spotlight Video: zip
Submission Number: 29
Loading