everyone
since 19 Mar 2024">EveryoneRevisionsBibTeXCC BY 4.0
This paper addresses the critical need for more accurate evaluation methods in text-to-image synthesis. While the standard CLIPScore metric can reflect text-image alignment to some extent, it often falls short in consistency with human perception. We propose the use of GPT-4 Vision as a novel evaluative standard, capable of interpreting text and image nuances akin to human cognition. Our study focuses on the pivotal role of prompt design in maximizing GPT-4 Vision's effectiveness, presenting a systematic discussion for prompt construction. Empirical evaluations demonstrate that GPT-4 Vision, augmented by our prompt-design strategy, aligns more closely with human judgment.