Keywords: image generation, generative AI, benchmark, Stable Diffusion, Midjourney, DALL-E, cognition
TL;DR: We introduce a generative image model benchmark in form of an open-source software package and use it to assess the performance of popular algorithms on 11 cognitive tasks.
Abstract: Recent developments in generative AI have highlighted how the field is moving to a state where models start developing a large set of skills and can solve a multitude of tasks without having actively been optimized for them. Benchmarks have been a key component driving model development in the past, but as models’ capabilities become more complex, it becomes harder to create benchmarks which can meaningfully capture the skillsets of algorithms to inform future model developments and training pipelines. While in the domain of language we have seen the development of a wide-ranging array of benchmarks, these are currently missing for generative image algorithms. Here we introduce the Generative Image Model Benchmark for Reasoning and Representation (GIMBRR) which is an open-source software package to assess generative image algorithms on 11 cognitive tasks using manual and automated evaluation pipelines. GIMBRR is built with customizability in mind, so that it can easily be updated with new tasks and assessment routines. This way it can be adapted to suit the needs of research teams with specific goals in image generation and to update task difficulty as generative image algorithms progress in general. We used GIMBRR to measure performance of three popular generative image models (DALL-E 2, Midjourney, Stable Diffusion), demonstrating that reasoning and representation tasks pose a considerable challenge to all of them. We have also demonstrated how cognitive theory can be used to perform a systematic analysis of generative and representational capabilities of these models.