Customized Multi-Subject Text-to-Image Generation with Causal Tuning

Chaoyang Li, Xin Wang, Wenwu Zhu, Lingzhi Wang, Ning Hu, Qing Liao

Published: 01 Jan 2025, Last Modified: 14 Mar 2026IEEE Transactions on Circuits and Systems for Video TechnologyEveryoneRevisionsCC BY-SA 4.0

Abstract: Subject-driven text-to-image generation aims to generate customized high-fidelity images based on text descriptions for specific subjects, which has gained increasing attention. Despite recent advancements in single-subject customization, existing methods often struggle with multi-subject scenarios, leading to distortions in subject identity. This challenge arises because entangled identity-relevant and irrelevant information can obscure subject identities, and inter-subject interference can cause confusion or loss of individual identities. To address these issues, we propose CausalT2I, a customized multi-subject text-to-image generation framework with causal tuning. First, we propose a subject-aware causal disentanglement method, which can self-adaptively distinguish causally relevant and irrelevant information for subjects through causal intervention and a causal disentangled objective. Then, we design a soft cross-attention guidance strategy to mitigate interference among different subjects by aligning the textual attributes of each subject with its identity-relevant visual attributes. Last, we introduce a causal denoising objective to optimize the denoising process using identity-preserved textual embeddings and identity-irrelevant visual embeddings. Extensive experiments show that CausalT2I has superior generation ability in subject-driven text-to-image generation over existing baseline methods and brings more flexibility and controllability for generating customized multi-subject images.

External IDs:doi:10.1109/tcsvt.2025.3645812