Image Transformer


Nov 03, 2017 (modified: Nov 03, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Image generation has been successfully cast as an autoregressive sequence generation or transformation problem. Recent work has shown that self-attention is an effective way of modeling textual sequences. In this work, we generalize a recently proposed model based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood. By applying a self-attention mechanism with a limited receptive field multiple times in parallel to different parts of the sequence, we significantly increase the length of sequences the model can process efficiently. We propose another simple extension of self-attention to allow it to take advantage of the two-dimensional nature of images. While conceptually simple, our generative models trained on two image data sets are competitive with or outperform the current state of the art in autoregressive image generation on two different data sets, CIFAR-10 and ImageNet. We also present results on image super-resolution with a large magnification ratio, applying an encoder-decoder configuration of our architecture. In a human evaluation study, we show that our super-resolution models improve significantly over previously published autoregressive super-resolution models. Images they generate fool human observers three times more often than the previous state of the art. Lastly, we provide examples of images generated or completed by our various models which, following previous work, we also believe to look pretty cool.
  • Keywords: image generation, super-resolution, self-attention, transformer