Not-So-CLEVR: Visual Relations Strain Feedforward Neural Networks

Junkyung Kim, Matthew Ricci, Thomas Serre

Feb 15, 2018 (modified: Feb 15, 2018) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: The robust and efficient recognition of visual relations in images is a hallmark of biological vision. Here, we argue that, despite recent progress in visual recognition, modern machine vision algorithms are severely limited in their ability to learn visual relations. Through controlled experiments, we demonstrate that visual-relation problems strain convolutional neural networks (CNNs). The networks eventually break altogether when rote memorization becomes impossible such as when the intra-class variability exceeds their capacity. We further show that another type of feedforward network, called a relational network (RN), which was shown to successfully solve seemingly difficult visual question answering (VQA) problems on the CLEVR datasets, suffers similar limitations. Motivated by the comparable success of biological vision, we argue that feedback mechanisms including working memory and attention are the key computational components underlying abstract visual reasoning.
  • TL;DR: Using a novel, controlled, visual-relation challenge, we show that same-different tasks critically strain the capacity of CNNs; we argue that visual relations can be better solved using attention-mnemonic strategies.
  • Keywords: Visual Relations, Visual Reasoning, SVRT, Attention, Working Memory, Convolutional Neural Network, Deep Learning, Relational Network

Loading