Abstract: The unconditional generation of high fidelity images is a longstanding benchmark
for testing the performance of image decoders. Autoregressive image models
have been able to generate small images unconditionally, but the extension of
these methods to large images where fidelity can be more readily assessed has
remained an open problem. Among the major challenges are the capacity to encode
the vast previous context and the sheer difficulty of learning a distribution that
preserves both global semantic coherence and exactness of detail. To address the
former challenge, we propose the Subscale Pixel Network (SPN), a conditional
decoder architecture that generates an image as a sequence of image slices of equal
size. The SPN compactly captures image-wide spatial dependencies and requires a
fraction of the memory and the computation. To address the latter challenge, we
propose to use multidimensional upscaling to grow an image in both size and depth
via intermediate stages corresponding to distinct SPNs. We evaluate SPNs on the
unconditional generation of CelebAHQ of size 256 and of ImageNet from size 32
to 128. We achieve state-of-the-art likelihood results in multiple settings, set up
new benchmark results in previously unexplored settings and are able to generate
very high fidelity large scale samples on the basis of both datasets.
TL;DR: We show that autoregressive models can generate high fidelity images.
Data: [CelebA](https://paperswithcode.com/dataset/celeba), [CelebA-HQ](https://paperswithcode.com/dataset/celeba-hq)
16 Replies
Loading