3D Object Representation Learning for Robust Classification and Pose estimation

19 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: classification, 3D-pose estimation, analysis-by-synthesis, render-and-compare
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We demonstrate enhanced robustness of 3D object representations at classification and pose estimation.
Abstract: In this work, we pioneer a framework for 3D object representation learning that achieves exceptionally robust classification and pose estimation results. In particular, we introduce a 3D representation of object categories using a 3D template mesh composed of feature vectors at each mesh vertex. Our model predicts, for each pixel in a 2D image, a feature vector of the corresponding vertex in each category template mesh, hence establishing dense correspondences between image pixels and the 3D template geometry of all target object categories. The feature vectors on the mesh vertices are trained to be viewpoint invariant by leveraging associated camera poses. During inference, we efficiently estimate the object class and pose by matching the class-specific templates to a target feature map in a two-step process: First, we classify the image by matching the vertex features of each template to an input feature map. Interestingly, we found that image classification can be performed using the vertex features only and without requiring the 3D mesh geometry, hence making the class label inference very efficient. In a second step, the object pose can be inferred using a render-and-compare matching process that ensures spatial consistency between the detected vertices. Our experiments on image classification demonstrate that our proposed 3D object representation has a number of profound advantages over classical image-based representations. First, it is exceptionally robust on a range of real-world and synthetic out-of-distribution shifts while performing on par with state-of-the-art architectures on in-distribution data in terms of accuracy and speed. Second, the estimated object pose is competitive with baseline models that were explicitly designed for pose estimation, but that cannot classify images. Finally, we show that our model has an enhanced interpretability by visualizing the individual vertex matches and the ability to perform classification and pose estimation jointly and consistently.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1954
Loading