TL;DR: Efficient visual generation using a decode time model scaling approach and cached computations, achieving similar performance with significantly reduced compute.
Abstract: Recent advances in visual generation have made significant strides in producing content of exceptional quality. However, most methods suffer from a fundamental problem - a bottleneck of inference computational efficiency. Most of these algorithms involve multiple passes over a transformer model to generate tokens or denoise inputs. However, the model size is kept consistent throughout all iterations, which makes it computationally expensive. In this work, we aim to address this issue primarily through two key ideas - (a) not all parts of the generation process need equal compute, and we design a decode time model scaling schedule to utilize compute effectively, and (b) we can cache and reuse some of the intermediate computation. Combining these two ideas leads to using smaller models to process more tokens while large models process fewer tokens. These different-sized models do not increase the parameter size, as they share parameters. We rigorously experiment with ImageNet256$\times$256 , UCF101, and Kinetics600 to showcase the efficacy of the proposed method for image/video generation and frame prediction. Our experiments show that with almost $3\times$ less compute than baseline, our model obtains competitive performance.
Lay Summary: Recently, creating realistic images or videos using computers has seen amazing progress. But there's a big hurdle, these systems are often slow and require a lot of computational power, especially when they're actually creating the content (what we call "inference"). Typically, they follow a multi-step approach, starting with rough outlines and gradually adding fine details. However, they use the same amount of computing power at every step, even when it's not needed. In our work, we show that by adjusting the computational power based on how complex each step is, we can speed up the process by up to $3\times$ without sacrificing quality. We also reuse the relevant computations from the past steps to decrease the computational power required in the current generation step. To show efficacy of our approach, we show performance of our method on publicly available datasets like ImageNet for image generation and UCF/Kinetics for video generation.
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: generation, efficiency, nested models, kv caching, model scaling
Submission Number: 4362
Loading