Abstract: Highlights•Introduces DiffBlender to unify multiple input modalities—structure, layout, and attribute—within a single T2I framework.•Utilizes a compact “Blender block” that preserves the pre-trained diffusion parameters, minimizing additional training overhead.•Enables efficient multimodal generation and composability across diverse conditions and user preferences.•Proposes mode-specific guidance for precise control over each modality, ensuring balanced and high-fidelity image synthesis.
External IDs:dblp:journals/eswa/KimLHKA26
Loading