Keywords: equivariance, disentangle, generative model, transform invariance, recurrent neural networks
TL;DR: Visual scene understanding, like "what" and "where" are certain objects, can be posed as a factorization problem and solved by a new kind of recurrent neural network.
Abstract: Inferring the position of objects and their rigid transformations is still an open problem in visual scene understanding. Here we propose a neuromorphic framework that poses scene understanding as a factorization problem and uses a resonator network to extract object identities and their transformations. The framework uses vector binding operations to produce generative image models in which binding acts as the equivariant operation for geometric transformations. A scene can therefore be described as a sum of vector products, which in turn can be efficiently factorized by a resonator network to infer objects and their poses. We also describe a hierarchical resonator network that enables the definition of a partitioned architecture in which vector binding is equivariant for horizontal and vertical translation within one partition, and for rotation and scaling within the other partition. We demonstrate our approach using synthetic scenes composed of simple 2D shapes undergoing rigid geometric transformations and color changes.