Stacked Capsule AutoencodersDownload PDF

Adam Kosiorek, Sara Sabour, Yee Whye Teh, Geoffrey Hinton

06 Sept 2019 (modified: 05 May 2023)NeurIPS 2019Readers: Everyone
Abstract: Any object can be seen as a geometrically organized set of interrelated parts. Capsule networks model object parts explicitly and use them to predict whole objects. Typically, capsules are trained discriminatively and assume that an object exists if it is predicted by several parts at the same time, while an iterative inference procedure ensures that a single part is not assigned to multiple objects. In this unsupervised version we devise a two-stage, stacked autoencoder: the first stage is responsible for segmenting images into parts and their poses, while the second stage organizes them into objects and their poses using a neural net encoder with already discovered parts as its input. The top stage is trained by reconstructing part poses under mixtures of predictions made by different objects, where we ensure that each part is predicted by exactly one object. The bottom stage reconstructs an image as a mixture of discovered parts. The top-level decoder predicts the poses of the parts by applying explicitly-parametrized affine transformations to the object pose parameters. These coordinate transformations do not depend on viewpoint, so learning them can be used to acquire viewpoint-invariant knowledge in a statistically efficient manner. We learn objects and their parts on unlabeled data, and when told the names of the learned classes, we achieve state-of-the-art results for unsupervised classification on \textsc{svhn} (55\%) and near state-of-the-art on \textsc{mnist} (98.5\%).
Code Link: https://github.com/google-research/google-research/stacked_capsule_autoencoders
CMT Num: 8996
1 Reply

Loading