Equivariant Diffusion Model With A5-Group Neurons for Joint Pose Estimation and Shape Reconstruction

Boyan Wan, Yifei Shi, Xiaohong Chen, Kai Xu

Published: 2025, Last Modified: 25 Aug 2025IEEE Trans. Pattern Anal. Mach. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Object pose estimation and shape reconstruction are inherently coupled tasks although they have so far been studied separately in most existing approaches. A few recent works addressed the problem of joint pose estimation and shape reconstruction, but they found difficulties in handling partial observations and shape ambiguities. An open challenge in this area is to design a mechanism that has the two tasks benefit each other and boost the performance and robustness of both. In this work, we advocate the use of diffusion models for joint estimation of category-level object poses and reconstruction of object geometry. Diffusion models formulate shape reconstruction as a generation process conditioned on input observations. It has two main advantages. First, the iterative inference of diffusion models provides a mechanism for iterative optimization for both pose estimation and shape reconstruction. Second, diffusion models allow multiple outputs starting from different input noises, which would address the problem of ambiguity caused by partial observations. To achieve this, we propose equivariant diffusion model for joint pose estimation and shape reconstruction. The approach consists of an equivariant feature extractor to aggregate features of the input point cloud and a ShapePose diffusion model to generate object pose and shape simultaneously. To avoid training the model on all possible shape poses in the SO(3) space, we propose to augment the diffusion model with A5-group neurons where the neurons are converted into 5D vectors and can be rotated with the alternating group A5. Based on the A5-group neurons, we implement SO(3)-equivariant 3D point convolution and SO(3)-equivariant concatenation, making the entire network SO(3)-equivariant. Moreover, to select the most plausible combination of pose and shape from the generated ones, we propose a geometry-based measure of plausibility for an estimated pose along with a reconstructed shape. Extensive experiments demonstrate the effectiveness of the proposed method. Specifically, our method achieves the state-of-the-art on two public datasets and a new dataset with stacked objects, in terms of shape reconstruction and pose estimation. In particular, we show the proposed method could provide multiple plausible outputs under partial observations and shape ambiguities.

External IDs:dblp:journals/pami/WanSCX25