A Human-factors Approach for Evaluating AI-generated Images

Kara Combs, Trevor J. Bihl, Arya Gadre, Isaiah Christopherson

Published: 01 Jan 2024, Last Modified: 03 Feb 2025SIGMIS-CPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: As generative artificial intelligence (AI) becomes more common in day-to-day life, AI-generated content (AIGC) needs to be accurate, relevant, and comprehensive. These characteristics typically are determined by subjective, human-based image quality assessment; however, there is limited research on the qualification of AI-generated image quality. Over 9,800 images were generated using Craiyon and OpenAI's DALL-E 2 text-to-image models and evaluated on the three criteria proposed for determining the quality of visual AIGC: (1) the number of objects, (2), resolution (strictly image quality; label/prompt exclusive), and (3) representativeness (consideration for how well the image matches the label/prompt). We observe that the paid, DALL-E 2 model, produced a dataset with fewer objects per image, higher resolution, and higher representativeness compared to Craiyon (free). There is an inverse relationship between the number of objects/images and its resolution and representativeness. This study establishes three subjective metrics for the evaluation of synthetic images to support the creation of more inclusive AIGC.