Enhancing Diversity in Text-to-Image Generation without Compromising Fidelity

Jiazhi Li; Mi Zhou; Mahyar Khayatkhoei; Jingyu Shi; Xiang Gao; Jiageng Zhu; Hanchen Xie; Xiyun Song; Zongfang Lin; Heather Yu; Jieyu Zhao

Enhancing Diversity in Text-to-Image Generation without Compromising Fidelity

Jiazhi Li, Mi Zhou, Mahyar Khayatkhoei, Jingyu Shi, Xiang Gao, Jiageng Zhu, Hanchen Xie, Xiyun Song, Zongfang Lin, Heather Yu, Jieyu Zhao

Published: 16 Oct 2025, Last Modified: 16 Oct 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Effective text-to-image generation must synthesize images that are both realistic in appearance (sample fidelity) and have sufficient variations (sample diversity). Diffusion models have achieved promising results in generating high-fidelity images based on textual prompts, and recently, several diversity-focused works have been proposed to improve their demographic diversity by enforcing the generation of samples from various demographic groups. However, another essential aspect of diversity, sample diversity—which enhances prompt reusability to generate creative samples that reflect real-world variability—has been largely overlooked. Specifically, how to generate images that have sufficient demographic and sample diversity while preserving sample fidelity remains an open problem because increasing diversity comes at the cost of reduced fidelity in existing works. To address this problem, we first propose a bimodal low-rank adaptation of pretrained diffusion models, which decouples the text-to-image conditioning, and then propose a lightweight bimodal guidance method that introduces additional diversity to the generation process using reference images retrieved through a fairness strategy by separately controlling the strength of text and image conditioning. We conduct extensive experiments to demonstrate the effectiveness of our method in enhancing demographic diversity (Intersectional Diversity (Shrestha et al., 2024)) by 2.47× and sample diversity (Recall (Kynkäänniemi et al., 2019)) by 1.45× while preserving sample fidelity (Precision (Kynkäänniemi et al., 2019)) compared to the baseline diffusion model.

Submission Length: Long submission (more than 12 pages of main content)

Supplementary Material: pdf

Assigned Action Editor: ~Chen_Sun1

Submission Number: 4939

Loading