Open Peer Review. Open Publishing. Open Access. Open Discussion. Open Directory. Open Recommendations. Open API. Open Source.
Ashish Vaswani, Niki Parmar, Jakob Uszkoreit, Noam Shazeer, Lukasz Kaiser
Feb 15, 2018 (modified: Feb 15, 2018)ICLR 2018 Conference Blind Submissionreaders: everyoneShow Bibtex
Abstract:Image generation has been successfully cast as an autoregressive sequence generation
or transformation problem. Recent work has shown that self-attention is
an effective way of modeling textual sequences. In this work, we generalize a
recently proposed model architecture based on self-attention, the Transformer, to
a sequence modeling formulation of image generation with a tractable likelihood.
By restricting the self-attention mechanism to attend to local neighborhoods we
significantly increase the size of images the model can process in practice, despite
maintaining significantly larger receptive fields per layer than typical convolutional
neural networks. We propose another extension of self-attention allowing it
to efficiently take advantage of the two-dimensional nature of images.
While conceptually simple, our generative models trained on two image data sets
are competitive with or significantly outperform the current state of the art in autoregressive
image generation on two different data sets, CIFAR-10 and ImageNet.
We also present results on image super-resolution with a large magnification ratio,
applying an encoder-decoder configuration of our architecture. In a human
evaluation study, we show that our super-resolution models improve significantly
over previously published autoregressive super-resolution models. Images they
generate fool human observers three times more often than the previous state of