TL;DR: A self-supervised capsule architecture that canonicalizes data while simultaneously decomposing point clouds into parts to perform unsupervised representation learning.
Abstract: We propose an unsupervised capsule architecture for 3D point clouds. We compute capsule decompositions of objects through permutation-equivariant attention, and self-supervise the process by training with pairs of randomly rotated objects. Our key idea is to aggregate the attention masks into semantic keypoints, and use these to supervise a decomposition that satisfies the capsule invariance/equivariance properties. This not only enables the training of a semantically consistent decomposition, but also allows us to learn a canonicalization operation that enables object-centric reasoning. To train our neural network we require neither classification labels nor manually-aligned training datasets. Yet, by learning an object-centric representation in a self-supervised manner, our method outperforms the state-of-the-art on 3D point cloud reconstruction, canonicalization, and unsupervised classification.
Architectural details, additional ablation studies, and qualitative results for the aligned setup are available in supplementary appendix. For more details, Click the image below to access the PDF
We provide [code in the accompanied subfolder]. Please see [README.md] for detailed instructions regarding the code.
Below are example videos demonstrating the quality of canonicalization. Our method achieves more stable canonicalization compared to Compass -- shown by the chairs and airplanes being well-aligned despite appearance changes.
Input | Aligned by Ours | Aligned by Compass | Input | Aligned by Ours | Aligned by Compass |
We show qualitative highlights, where we decompose 3D point clouds and auto-encode them using Canonical Capsules. We color each Canonical Capsule with a unique colour, and similarly color "patches" from the reconstruction heads of 3D-PointCapsNet and AtlasNetV2. Canonical Capsules provide semantically consistent decomposition that is aligned in the canonical frame, leading to improved reconstruction quality and unsupervised classification performance.
Results with the single-category Canonical Capsules | |||||
Input | Decomposition | Ours reconstruction in canonical frame - not a still image! | Ours reconstruction in input frame | 3D-PointCapsNet reconstruction | AtlasNetV2 reconstruction |
Results with the multi-category Canonical Capsules | |||||
Input | Decomposition | Ours reconstruction in canonical frame | Ours reconstruction in input frame | 3D-PointCapsNet reconstruction | AtlasNetV2 reconstruction |
The supplementary videos are encoded by FFMPEG with h.264 codec. If you can't play the video, please download the VLC player at: http://www.videolan.org/vlc/index.html