Abstract: Effective text-to-image generation must synthesize images that are both realistic in appearance (sample fidelity) and have sufficient variations (sample diversity). Diffusion models have achieved promising results in generating high-fidelity images based on textual prompts, and recently, several diversity-focused works have been proposed to improve their demographic diversity by enforcing the generation of samples from various demographic groups. However, another essential aspect of diversity, sample diversity---which enhances prompt reusability to generate creative samples that reflect real-world variability---has been largely overlooked. Specifically, how to generate images that have sufficient demographic and sample diversity while preserving sample fidelity remains an open problem because increasing diversity comes at the cost of reduced fidelity in existing works. To address this problem, we first propose a bimodal low-rank adaptation of pretrained diffusion models, which decouples the text-to-image conditioning, and then propose a lightweight bimodal guidance method that introduces additional diversity to the generation process using reference images retrieved through a fairness strategy by separately controlling the strength of text and image conditioning. We conduct extensive experiments to demonstrate the effectiveness of our method in enhancing demographic diversity (Intersectional Diversity~\citep{FairRAG}) by 2.47× and sample diversity (Recall~\citep{precision_recall}) by 1.45× while preserving sample fidelity (Precision~\citep{precision_recall}) compared to the baseline diffusion model.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Chen_Sun1
Submission Number: 4939
Loading