TP‑Blend: Textual‑Prompt Attention Pairing for Precise Object‑Style Blending in Diffusion Models

TMLR Paper4811 Authors

09 May 2025 (modified: 23 May 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Current text–conditioned diffusion editors handle single object replacement well but struggle when a new object and a new style must be introduced simultaneously. We present Twin‑Prompt Attention Blend (TP‑Blend), a lightweight training‑free framework that receives two separate textual prompts, one specifying a blend object and the other defining a target style, and injects both into a single denoising trajectory. TP‑Blend is driven by two complementary attention processors. Cross‑Attention Object Fusion (CAOF) first averages head‑wise attention to locate spatial tokens that respond strongly to either prompt, then solves an entropy‑regularised optimal transport problem that reassigns complete multi‑head feature vectors to those positions. CAOF updates feature vectors at the full combined dimensionality of all heads (e.g., 640 dimensions in SD‑XL), preserving rich cross‑head correlations while keeping memory low. Self‑Attention Style Fusion (SASF) injects style at every self‑attention layer through Detail‑Sensitive Instance Normalization. A lightweight one‑dimensional Gaussian filter separates low‑ and high‑frequency components; only the high‑frequency residual is blended back, imprinting brush‑stroke‑level texture without disrupting global geometry. SASF further swaps the Key and Value matrices with those derived from the style prompt, enforcing context‑aware texture modulation that remains independent of object fusion. Extensive experiments show that TP‑Blend produces high‑resolution, photo‑realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Nicolas_THOME2
Submission Number: 4811
Loading