Not-So-CLEVR: Visual Relations Strain Feedforward Neural Networks


Nov 03, 2017 (modified: Nov 03, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: The robust and efficient recognition of visual relations in images is a hallmark of biological vision. Here, we argue that, despite recent progresses in visual recognition, modern machine vision algorithms are severely limited in their ability to learn visual relations. Through controlled experiments, we demonstrate that visual-relation problems strain convolutional neural networks (CNNs). The networks eventually break altogether when rote memorization becomes impossible such as when the intra-class variability exceeds their capacity. We further show that another class of feedforward networks called relational networks (RNs) which were shown to successfully solve seemingly challenging visual question answering (VQA) challenges on the CLEVR datasets, suffer the same limitations. Motivated by the comparable success of biological vision, we argue that the incorporation of feedback mechanisms including working memory and attention will constitute a necessary step towards building machines that are capable of abstract visual reasoning.
  • TL;DR: Using a novel, controlled, visual-relation challenge, we show that same-different tasks critically strain the capacity of CNNs; we argue that visual relations can be better solved using attention-mnemonic strategies.
  • Keywords: Visual Relations, Visual Reasoning, SVRT, Attention, Working Memory, Convolutional Neural Network, Deep Learning, Relational Network