- Keywords: image classification, multi-scale processing, attention
- TL;DR: We propose a novel architecture that traverses an image pyramid in a top-down fashion, while it visits only the most informative regions along the way.
- Abstract: The available resolution in our visual world is extremely high, if not infinite. Existing CNNs can be applied in a fully convolutional way to images of arbitrary resolution, but as the size of the input increases, they can not capture contextual information. In addition, computational requirements scale linearly to the number of input pixels, and resources are allocated uniformly across the input, no matter how informative different image regions are. We attempt to address these problems by proposing a novel architecture that traverses an image pyramid in a top-down fashion, while it uses a hard attention mechanism to selectively process only the most informative image parts. We conduct experiments on MNIST and ImageNet datasets, and we show that our models can significantly outperform fully convolutional counterparts, when the resolution of the input is that big that the receptive field of the baselines can not adequately cover the objects of interest. Gains in performance come for less FLOPs, because of the selective processing that we follow. Furthermore, our attention mechanism makes our predictions more interpretable, and creates a trade-off between accuracy and complexity that can be tuned both during training and testing time.