An Analysis of Query-Based Approaches to Image Attribute Prediction

Aaron Walsman, Daniel Gordon, Dieter Fox

Feb 12, 2018 (modified: Feb 12, 2018) ICLR 2018 Workshop Submission readers: everyone
  • Abstract: Deep learning techniques have emerged as the de facto method for solving many classification-based computer vision tasks. Each of these tasks require multiple (often hundreds of) examples of each category in order to learn accurate classifiers. More recently, complex visual reasoning tasks have been proposed to challenge this classification-based paradigm. Deep networks that succeed on the CLEVR task learn to combine information from multiple sub-systems rather than attempting to extract all necessary information in a single forward pass. We explore a similar setting which compares multi-class classification networks against query-based networks across a wide variety of attributes in a single image. We show that query-based networks outperform traditional multi-class networks given a fixed network capacity due to their ability to focus on information relevant to the current query. We also show that query networks learn faster than multi-class networks because their focus-based representation on specific attributes allows for more multi-modal flexibility per training iteration.
  • TL;DR: We show the effectiveness of query-based networks on learning image attributes over the standard multi-class networks.
  • Keywords: Attribute prediction, feature representation, visual question answering