- CMT id: 204
- Abstract: Recently there has been growing interest in building ``active'' visual object recognizers, as opposed to ``passive'' recognizers which classifies a given static image into a predefined set of object categories. In this paper we propose to generalize recent end-to-end active visual recognizers into a controller-recognizer framework. In this framework, the interfaces with an external manipulator, while the recognizer classifies the visual input adjusted by the manipulator. We describe two recently proposed controller-recognizer models-- the recurrent attention model (Mnih et al., 2014) and spatial transformer network (Jaderberg et al., 2015)-- as representative examples of controller-recognizer models. Based on this description we observe that most existing end-to-end controller-recognizers tightly couple the controller and recognizer. We consider whether this tight coupling is necessary, and try to answer this empirically by investigating a decoupled controller and recognizer. Our experiments revealed that it is not always necessary to tightly couple them, and that by decoupling the controller and recognizer, there is a possibility to build a generic controller that is pretrained and works together with any subsequent recognizer.
- Conflicts: umontreal.ca, nyu.edu, ox.ac.uk