Generating Interpretable Images with Controllable Structure

Scott Reed, Aäron van den Oord, Nal Kalchbrenner, Victor Bapst, Matt Botvinick, Nando de Freitas

Nov 04, 2016 (modified: Jan 18, 2017) ICLR 2017 conference submission readers: everyone
  • Abstract: We demonstrate improved text-to-image synthesis with controllable object locations using an extension of Pixel Convolutional Neural Networks (PixelCNN). In addition to conditioning on text, we show how the model can generate images conditioned on part keypoints and segmentation masks. The character-level text encoder and image generation network are jointly trained end-to-end via maximum likelihood. We establish quantitative baselines in terms of text and structure-conditional pixel log-likelihood for three data sets: Caltech-UCSD Birds (CUB), MPII Human Pose (MHP), and Common Objects in Context (MS-COCO).
  • TL;DR: Autoregressive text-to-image synthesis with controllable spatial structure.
  • Keywords: Deep learning, Computer vision, Multi-modal learning, Natural language processing
  • Conflicts:,