Abstract: Filters in a convolutional network are typically parametrized in a pixel basis. As an orthonormal basis, pixels may represent any arbitrary vector in Rn. In this paper, we relax this orthonormality requirement and extend the set of viable bases to the generalized notion of frames. When applying suitable frame bases to ResNets on Cifar-10+ we demonstrate improved error rates by substitution only. By exploiting the transformation properties of such generalized bases, we arrive at steerable frames, that allow to continuously transform CNN filters under arbitrary Lie-groups. Further allowing us to locally separate pose from canonical appearance. We implement this in the Dynamic Steerable Frame Network, that dynamically estimates the transformations of filters, conditioned on its input. The derived method presents a hybrid of Dynamic Filter Networks and Spatial Transformer Networks that can be implemented in any convolutional architecture, as we illustrate in two examples. First, we illustrate estimation properties of steerable frames with a Dynamic Steerable Frame Network, compared to a Dynamic Filter Network on the task of edge detection, where we show clear advantages of the derived steerable frames. Lastly, we insert the Dynamic Steerable Frame Network as a module in a convolutional LSTM on the task of limited-data hand-gesture recognition from video and illustrate effective dynamic regularization and show clear advantages over Spatial Transformer Networks. In this paper, we have laid out the foundations of Frame-based convolutional networks and Dynamic Steerable Frame Networks while illustrating their advantages for continuously transforming features and data-efficient learning.
TL;DR: Introducing non-orthogonal and overcomplete bases for ConvNets and derive Dynamic Steerable Frame Networks, a hybrid of Dynamic Filter Networks and Spatial Transformers.
Conflicts: uva.nl, kuleuven.be
Keywords: Computer vision, Deep learning
11 Replies
Loading