Keywords: CV: Bias, Fairness & Privacy, CV: Language and Vision, Image Synthesis
TL;DR: We introduce a new task of culturally-aware text-to-image synthesis: given a specific cultural context, we generate visual content that is both accurate and inoffensive to the cultural consumers.
Abstract: It has been shown that accurate representation in media improves the well-being of the people who consume it. By contrast, inaccurate representations can negatively affect viewers and lead to harmful perceptions of other cultures. As artificial intelligence improves in image synthesis tasks and becomes ubiquitous in content creation, special attention will need to be paid to ensure that accurate representation is achieved; however, it is well understood that these models absorb the bias of their training data, are Anglo-centric, and can amplify harmful stereotypes. To achieve inclusive representation in generated images, we introduce a new task of culturally-aware text-to-image synthesis; given a specific cultural context, the goal is to generate visual content that is both accurate and inoffensive to the cultural consumers. We then present our proposed approach for culturally-aware text-to-image synthesis, Culturally-Aware Stable Diffusion, comprised of two priming techniques: (1) Fine-tuning a pre-trained text-to-image synthesis model, Stable Diffusion, on a hand-selected, culturally-representative image dataset, and (2) Augmenting the input prompt with additional culturally relevant language data. The culturally relevant data is curated by people who have a personal relationship with that particular culture, and we recruit participants who are a part of that culture to evaluate the method. Our preliminary experiments indicate that priming using both text and image is effective in improving the cultural relevance of generated images.
Submission Type: non-archival
Presentation Type: onsite
Presenter: Zhixuan Liu, Peter Schaldenbrand, Youeun Shin, Beverley-Claire Okogwu, Youngsik Yun, Jean Oh