Now They See It, Now They Don't: Multimodal Reward Models Exhibit Unreliability in Physical World Constraints

Published: 18 May 2026, Last Modified: 18 May 2026CoNLL 2026 ArchivalEveryoneRevisionsBibTeXCC BY 4.0
Keywords: spatial reasoning, text-to-image models, multimodal reward models
TL;DR: We show that humans consider reasoning errors about spatial orientation to be particularly impactful on image quality and that multimodal reward models exhibit unreliability when evaluating spatial orientation errors.
Abstract: Generative AI systems, especially those driven by autoregressive and diffusion-based models, are known to struggle with spatial reasoning. As such, it becomes critical to understand how humans regard those failure modes. In this paper, we examine how humans judge different types of errors in images generated by a text-to-image model. We curated prompts that described common household objects with variance in number, spatial relations, and orientations, and generated a variety of images using each prompt. Humans observed pairs of images generated using the same prompt and answered a set of systematic questions about each image. Survey results showed that incorrect spatial *orientation* regularly emerges as a reason that the generated images do not accurately represent the prompt. We further investigated how RLHF-based multimodal reward models score prompt-image alignment over the same data, and whether they can reliably distinguish the better image in a pairwise setting, as humans do. We find that even though a general cross-task reward model may output alignment scores that accord with those of humans, its reasoning traces are flawed with respect to spatial orientational and relational indicators---the very factors that human annotators rated as the most consequential errors in generated images. Our results show that human annotators regard spatial reasoning errors as highly impactful on the correctness of generated images, and undermine the reliability of multimodal reward model scores as a baseline for evaluating image quality.
Scope Confirmation: To the best of my judgment, this submission falls within the scope of CoNLL.
Primary Area Selection: Multimodality and Grounding
Secondary Area Selection: Computational Psycholinguistics, Cognition and Linguistics
Use Of Generative Artificial Intelligence Tools: Yes, other (specify below)
Other Use Of Generative Artificial Intelligence Tools: Figure 15 in the appendix was created with the assistance of Nano Banana to compose the images and the text.
Data Collection From Human Subjects: Yes, with details included in the main paper or in an appendix on (1) how the data was obtained (2) how participants were recruited and paid (3) how consent was obtained (4) whether a IRB protocol was approved for this study. Note that providing this information is obligatory.
Submission Type: Archival: I certify that the submission has not been previously published, nor is the material in it under review by another journal or conference. Further, no material in it will be submitted for review at another conference or journal while under review by CoNLL 2026.
Submission Number: 188
Loading