Keywords: auto-regressive models, diffusion models, condition capturing
Abstract: Autoregressive (AR) diffusion models have recently attracted significant attention for their ability to generate high-quality, diverse samples across various tasks involving text, image, and video generation. Despite this surge of interest, the theoretical underpinnings of AR diffusion remain largely unexplored.
This work, for the first time, investigates the inference complexity and underlying mechanisms behind AR diffusion's strong performance. Building on the sequential patch-by-patch generation paradigm, we formalize the inference process as a series of stage-wise conditional distribution samplings. This formulation yields that, when conditional components are learned accurately, the resulting approximation to the full joint distribution becomes highly precise. Our theoretical analysis establishes the AR diffusion inference complexity bound for a general number of stages $K$, requiring only minimal smoothness assumptions on the score functions and their estimation error.
The complexity includes an additional factor proportional to the number of stages, reflecting the model's sequential architecture. On the other hand, we show that this stage-wise design can be advantageous for learning specific conditional dependencies between patches, which may be overlooked by conventional diffusion models that focus primarily on joint distributions. Subsequent experiments on synthetic data validate this theoretical insight.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Submission Number: 16971
Loading