Generating Interpretable Images with Controllable Structure

Scott Reed; Aäron van den Oord; Nal Kalchbrenner; Victor Bapst; Matt Botvinick; Nando de Freitas

Generating Interpretable Images with Controllable Structure

Scott Reed, Aäron van den Oord, Nal Kalchbrenner, Victor Bapst, Matt Botvinick, Nando de Freitas

13 Jul 2025 (modified: 21 Jul 2022)ICLR 2017 Invite to WorkshopReaders: Everyone

Abstract: We demonstrate improved text-to-image synthesis with controllable object locations using an extension of Pixel Convolutional Neural Networks (PixelCNN). In addition to conditioning on text, we show how the model can generate images conditioned on part keypoints and segmentation masks. The character-level text encoder and image generation network are jointly trained end-to-end via maximum likelihood. We establish quantitative baselines in terms of text and structure-conditional pixel log-likelihood for three data sets: Caltech-UCSD Birds (CUB), MPII Human Pose (MHP), and Common Objects in Context (MS-COCO).

TL;DR: Autoregressive text-to-image synthesis with controllable spatial structure.

Conflicts: google.com, umich.edu

Keywords: Deep learning, Computer vision, Multi-modal learning, Natural language processing

11 Replies

Loading