Abstract: ControlNet enables fine-grained control over image layout in prominent generators like Stable Diffusion. However, it lacks the ability to take into account localized textual descriptions that indicate which image region is described by which phrase in the prompt. In this work, we enable ControlNet to use localized descriptions using a training-free approach that modifies the cross-attention scores during generation. For doing so, we adapt and investigate several existing cross-attention control methods and identify shortcomings that cause failure or image degradation under some conditions. To address these shortcomings, we develop a novel cross-attention manipulation method. Qualitative and quantitative experimental studies demonstrate the effectiveness of the proposed augmented ControlNet.
External IDs:dblp:conf/pkdd/LukovnikovF25
Loading