Patch-Wise and Keyword-Aware: Efficient Multi-Condition Control of Diffusion Transformers via Position-Aligned and Keyword-Scoped Attention
Keywords: multi-condition control generation, diffusion transformers
Abstract: While modern text-to-image models excel at generation from prompts, they often lack the fine-grained control necessary for specific user requirements like spatial layouts or subject appearances. Multi-condition control emerges as a key solution to this limitation. However, its application in Diffusion Transformers (DiTs) is severely hampered by the "concatenate-and-attend'' strategy, which creates a prohibitive computational and memory bottleneck. Our analysis reveals that this computation is largely redundant. We therefore introduce Patch-wise and Keyword-Aware Attention (PKA), a framework using two specialized modules to eliminate this inefficiency. Position-Aligned Attention (PAA) confines spatial control to aligned patches, while Keyword-Scoped Attention (KSA) restricts subject-driven control to keyword-activated regions. Complemented by an early-timestep sampling strategy that accelerates training, PKA achieves up to a 10$\times$ inference speedup and a 5.12$\times$ reduction in attention module VRAM, all while maintaining or improving generative quality. Our work offers a practical path towards complex, fine-grained, and resource-friendly AI generation.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7568
Loading