Keywords: autoregressive image generation, multi-conditional control
Abstract: Controlling generative models with multiple, simultaneous conditions is a critical yet challenging frontier. Mainstream diffusion models, despite their success in single-condition synthesis, often exhibit performance degradation and condition conflicts in this setting. We identify the root cause of this limitation as the inherent *parallel* generation process of these models. By applying all conditional constraints globally and concurrently, they create a "tug-of-war" between competing guidance signals, forcing suboptimal compromises. This paper advocates for a paradigm shift to *serial* generation. We posit that autoregressive models, by constructing images token-by-token, can resolve conflicting constraints locally and sequentially, enabling a more harmonious and precise integration of multiple conditions. To realize this paradigm, we introduce **ContextAR**, an autoregressive framework that represents diverse conditions within a unified sequence. It employs a novel Conditional Context-aware Attention mechanism that restricts inter-condition communication, enhancing both compositional flexibility and computational efficiency. Extensive experiments validate our hypothesis: ContextAR significantly outperforms state-of-the-art parallel (diffusion-based) methods in controllability and faithfulness to multiple conditions, without a trade-off in image quality. Our work establishes serial generation as a more powerful and flexible paradigm for the complex task of multi-conditional image synthesis.
Primary Area: generative models
Submission Number: 19073
Loading