FreeControl: Efficient, Training-Free Structural Control via One-Step Attention Extraction

Jiang Lin; Xinyu Chen; Song Wu; Zhiqiu Zhang; Jizhi Zhang; Ye Wang; Qiang Tang; qian Wang; Jian Yang; Zili Yi

FreeControl: Efficient, Training-Free Structural Control via One-Step Attention Extraction

Jiang Lin, Xinyu Chen, Song Wu, Zhiqiu Zhang, Jizhi Zhang, Ye Wang, Qiang Tang, qian Wang, Jian Yang, Zili Yi

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion models, Controlled Image Generation

TL;DR: We present FreeControl, a training-free method for structural control in diffusion models that achieves strong semantic alignment with less than 2\% additional inference cost.

Abstract: Controlling the spatial and semantic structure of diffusion-generated images remains a challenge. Existing methods like ControlNet rely on handcrafted condition maps and retraining, limiting flexibility and generalization. Inversion-based approaches offer stronger alignment but incur high inference cost due to dual-path denoising. We present \textbf{FreeControl}, a training-free framework for semantic structural control in diffusion models. Unlike prior methods that extract attention across multiple timesteps, FreeControl performs \textit{one-step attention extraction} from a single, optimally chosen timestep and reuses it throughout denoising. This enables efficient structural guidance without inversion or retraining. To further improve quality and stability, we introduce \textit{Latent-Condition Decoupling (LCD)}: a principled separation of the timestep condition and the noised latent used in attention extraction. LCD provides finer control over attention quality and eliminates structural artifacts. FreeControl also supports compositional control via reference images assembled from multiple sources, enabling intuitive scene layout design and stronger prompt alignment. FreeControl introduces a new paradigm for test-time control—enabling structurally and semantically aligned, visually coherent generation directly from raw images, with the flexibility for intuitive compositional design and compatibility with modern diffusion models at ~5\% additional cost.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 9405

Loading