Abstract: This work presents basic research on convolutional neural networks that learn to predict explainable scene graphs from input images without external supervision during training. Unlike existing approaches following a fully-supervised training paradigm, thereby requiring meticulous annotations, we are the first to present a self-supervised approach based on a fully differentiable auto-encoder in which the bottleneck is the graph that corresponds to the input image. To demonstrate the unique conceptual properties of our graph auto-encoder, we apply it to an example task that performs simple rule-based shape classification using only the information in the graph, and we show that our approach allows for the successful classification of shapes that are never seen during training. We report exploratory findings of our research in which the presented approach is applied to elementary line drawings depicting single shapes with limited complexity. We show that our approach exhibits comparable performance to a fully-supervised graph parser baseline, and generalizes significantly better than a conventional image classifier. Although extensive future research is needed to bring our approach to complex natural images, we believe it makes a valuable conceptual step in bridging deep neural networks with graph-based symbolic knowledge representations.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Jonathon_Shlens1
Submission Number: 230
Loading