Keywords: Image generative models, world models
Abstract: Building world models that accurately and comprehensively represent the real
world is a holy grail for image generative models as it would enable their use as
world simulators. For conditional image generative models to be successful world
models, they should not only excel at image quality and prompt-image consistency
but also ensure high representation diversity. However, current research in
generative models mostly focuses on creative applications that are predominantly
concerned with human preferences of image quality and aesthetics. We note that
generative models have inference time mechanisms – or knobs – that allow the
control of generation consistency, quality, and diversity. In this paper, we use
state-of-the-art text-to-image and their knobs to draw consistency-diversity-realism
Pareto fronts that provide a holistic view on consistency-diversity-realism
multi-objective. Our experiments suggest that realism and consistency can both be
improved simultaneously; however there exists a clear tradeoff between realism/-
consistency and diversity. By looking at Pareto optimal points, we note that earlier
models are better at representation diversity and worse in consistency-realism, and
more recent models excel in consistency-realism while decreasing significantly
the representation diversity. Overall, our analysis clearly shows that there is no
best model and the choice of model should be determined by the downstream
application. With this analysis, we invite the research community to consider
Pareto fronts as an analytical tool to measure progress towards world models.
Submission Number: 41
Loading