MindScore: quantifying human preference for text-to-image generation through multi-view lens

Published: 2025, Last Modified: 15 Jan 2026Sci. China Inf. Sci. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Understanding and quantifying the capabilities of foundation models, particularly in text-to-image (T2I) generation, is crucial for verifying their alignment with human expectations and practical requirements. However, evaluating T2I foundation models presents significant challenges due to the complex, multi-dimensional psychological factors that influence human preferences for generated images. In this work, we propose MindScore, a multi-view framework for assessing the generation capacity of T2I models through the lens of human preference. Specifically, MindScore decomposes the evaluation into four complementary modules that align with human cognitive processing of images: matching, faithfulness, quality, and realness. The matching module quantifies the semantic alignment between generated images and prompt text, while the faithfulness module measures how accurately the images reflect specific prompt details. Furthermore, we incorporate quality and realness modules to capture deeper psychological preferences, recognizing that unpleasant or distorted images often trigger adverse human responses. Extensive experiments on three T2I datasets with human preference annotations clearly validate the superiority of our proposed MindScore over various state-of-the-art baselines. Our case studies further reveal that MindScore offers valuable insights into T2I generation from a human-centric perspective.
Loading