Keywords: Computer Vision, Image Generation, Image-to-Image Generation
Abstract: Recent advances in autoregressive (AR) models have demonstrated their potential
to rival diffusion models in image synthesis. However, for complex spatially-
conditioned generation, current AR approaches rely on fine-tuning the pre-trained
model, leading to significant training costs. In this paper, we propose the Efficient
Control Model (ECM), a plug-and-play framework featuring a lightweight control
module that introduces control signals via a distributed architecture. This archi-
tecture consists of context-aware attention layers that refine conditional features
using real-time generated tokens, and a shared gated feed-forward network (FFN)
designed to maximize the utilization of its limited capacity and ensure coherent
control feature learning. Furthermore, recognizing the critical role of early-stage
generation in determining semantic structure, we introduce an early-centric sam-
pling strategy that prioritizes learning early control sequences. This approach re-
duces computational cost by lowering the number of training tokens per iteration,
while a complementary temperature scheduling during inference compensates for
the resulting insufficient training of late-stage tokens. Extensive experiments on
scale-based AR models validate that our method achieves high-fidelity and diverse
control over image generation, surpassing existing baselines while significantly
improving both training and inference efficiency.
Primary Area: generative models
Submission Number: 6054
Loading