Patch-Wise and Keyword-Aware: Efficient Multi-Condition Control of Diffusion Transformers via Position-Aligned and Keyword-Scoped Attention

Chao Zhou; Tianyi Wei; Yiling Chen; Wenbo Zhou; Nenghai Yu

Patch-Wise and Keyword-Aware: Efficient Multi-Condition Control of Diffusion Transformers via Position-Aligned and Keyword-Scoped Attention

Chao Zhou, Tianyi Wei, Yiling Chen, Wenbo Zhou, Nenghai Yu

16 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-condition control generation, diffusion transformers

Abstract: While modern text-to-image models excel at generation from prompts, they often lack the fine-grained control necessary for specific user requirements like spatial layouts or subject appearances. Multi-condition control emerges as a key solution to this limitation. However, its application in Diffusion Transformers (DiTs) is severely hampered by the "concatenate-and-attend'' strategy, which creates a prohibitive computational and memory bottleneck. Our analysis reveals that this computation is largely redundant. We therefore introduce Patch-wise and Keyword-Aware Attention (PKA), a framework using two specialized modules to eliminate this inefficiency. Position-Aligned Attention (PAA) confines spatial control to aligned patches, while Keyword-Scoped Attention (KSA) restricts subject-driven control to keyword-activated regions. Complemented by an early-timestep sampling strategy that accelerates training, PKA achieves up to a 10$\times$ inference speedup and a 5.12$\times$ reduction in attention module VRAM, all while maintaining or improving generative quality. Our work offers a practical path towards complex, fine-grained, and resource-friendly AI generation.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 7568

Loading