Keywords: Text-to-Image Models, Cross Cultural Evaluation, Fairness, Bias, Social Activities
Abstract: Cultural nuances are best expressed through social interactions, yet current text-to-image (T2I) benchmarks focus largely on object-centric artifacts (e.g., food, landmarks, and attire). In this work, we study the cultural faithfulness of T2I models (i.e., adherence to the target culture) through social activities. To this end, we introduce CULTIVate, a new benchmark of 576 activities across 9 categories (e.g., dancing, greeting, dining) with over 19,000 images from 16 countries. We further propose AHEaD, an explainable framework that measures cultural understanding along four dimensions: cultural Alignment, Hallucination, Exaggeration, and Diversity. Unlike prior work relying on costly human evaluation or image-text alignment (ITA), AHEaD uses culturally-grounded descriptors to provide quantitative, interpretable feedback that enables iterative image refinement. Our analysis shows ITA metrics correlate poorly with human judgments and that alignment alone is insufficient to capture faithfulness. In contrast, FAITH (combining alignment, hallucination, and exaggeration) achieves 27% higher correlation than baselines. Finally, we observe systematic disparities, with generated images being consistently more faithful for Global North than Global South cultures.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7832
Loading