Generating Interpretable Images with Controllable StructureDownload PDF

22 Nov 2024 (modified: 21 Jul 2022)ICLR 2017 Invite to WorkshopReaders: Everyone
Abstract: We demonstrate improved text-to-image synthesis with controllable object locations using an extension of Pixel Convolutional Neural Networks (PixelCNN). In addition to conditioning on text, we show how the model can generate images conditioned on part keypoints and segmentation masks. The character-level text encoder and image generation network are jointly trained end-to-end via maximum likelihood. We establish quantitative baselines in terms of text and structure-conditional pixel log-likelihood for three data sets: Caltech-UCSD Birds (CUB), MPII Human Pose (MHP), and Common Objects in Context (MS-COCO).
TL;DR: Autoregressive text-to-image synthesis with controllable spatial structure.
Conflicts: google.com, umich.edu
Keywords: Deep learning, Computer vision, Multi-modal learning, Natural language processing
11 Replies

Loading