Scaling 3D Compositional Models for Robust Classification and Pose Estimation

Xiaoding Yuan; Prakhar Kaushik; Guofeng Zhang; Artur Jesslen; Adam Kortylewski; Alan Yuille

Scaling 3D Compositional Models for Robust Classification and Pose Estimation

Xiaoding Yuan, Prakhar Kaushik, Guofeng Zhang, Artur Jesslen, Adam Kortylewski, Alan Yuille

27 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: analysis by synthesis, image classification, 3D representation, compositional models

Abstract: Deep learning algorithms for object classification and 3D object pose estimation lack robustness to out-of-distribution factors such as synthetic stimuli, changes in weather conditions, and partial occlusion. Human vision, however, is typically much more robust to all these factors. This is arguably because human vision exploits 3D object representations which are invariant to most of these factors. Recently a class of 3D compositional models have been developed where objects are represented in terms of 3D meshes, with typically 1000 vertices associated with learnt vertex features. These models have shown robustness in small-scale settings, involving 10 or 12 objects, but it is unclear that they can be scaled up to 100s of object classes. The main problem is that their training involves supervised contrastive learning on the mesh vertices representing the objects and requires each vertex to be contrasted with all other vertices, which scales quadratically with the vertex number. A newly available dataset with 3D annotations for 188 object classes allows us to address this scaling challenge. We present a strategy which exploits the compositionality of the objects, i.e. the independence of the feature vectors of the vertices, which greatly reduces the training time while also improving the performance of the algorithms. We first refactor the per-vertex contrastive learning into contrasting within class and between classes. Then we propose a process that dynamically decouples the contrast between classes which are rarely confused, and enhances the contrast between the vertices of classes that are most confused. Our large-scale 3D compositional model not only achieves state-of-the-art performance on object classification and 3D pose estimation in a unified manner surpassing ViT and ResNet, but is also more robust to out-of-distribution testing including occlusion, weather conditions, and synthetic data. This paves the way for scalable 3D object understanding and opens exciting possibilities for applications in robotics, autonomous systems, and augmented reality.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10681

Loading