- Abstract: Deep learning techniques have emerged as the de facto method for solving many classification-based computer vision tasks. Each of these tasks require multiple (often hundreds of) examples of each category in order to learn accurate classifiers. More recently, complex visual reasoning tasks have been proposed to challenge this classification-based paradigm. Deep networks that succeed on the CLEVR task learn to combine information from multiple sub-systems rather than attempting to extract all necessary information in a single forward pass. We explore a similar setting which compares multi-class classification networks against query-based networks across a wide variety of attributes in a single image. We show that query-based networks outperform traditional multi-class networks given a fixed network capacity due to their ability to focus on information relevant to the current query. We also show that query networks learn faster than multi-class networks because their focus-based representation on specific attributes allows for more multi-modal flexibility per training iteration.
- Keywords: Attribute prediction, feature representation, visual question answering
- TL;DR: We show the effectiveness of query-based networks on learning image attributes over the standard multi-class networks.