Enabling ControlNet to follow Localized Descriptions Using Cross-Attention Control

Denis Lukovnikov, Asja Fischer

Published: 2025, Last Modified: 02 Mar 2026ECML/PKDD (2) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: ControlNet enables fine-grained control over image layout in prominent generators like Stable Diffusion. However, it lacks the ability to take into account localized textual descriptions that indicate which image region is described by which phrase in the prompt. In this work, we enable ControlNet to use localized descriptions using a training-free approach that modifies the cross-attention scores during generation. For doing so, we adapt and investigate several existing cross-attention control methods and identify shortcomings that cause failure or image degradation under some conditions. To address these shortcomings, we develop a novel cross-attention manipulation method. Qualitative and quantitative experimental studies demonstrate the effectiveness of the proposed augmented ControlNet.

External IDs:dblp:conf/pkdd/LukovnikovF25