Characterizing Visual Narrative Freedom under Loose Image–Text Alignment

Yanru Jiang; Gavin Olson; Eugenio Herrera-Berg; Rick Dale; Hongjing Lu; Elisa Kreiss

Characterizing Visual Narrative Freedom under Loose Image–Text Alignment

Yanru Jiang, Gavin Olson, Eugenio Herrera-Berg, Rick Dale, Hongjing Lu, Elisa Kreiss

Published: 05 May 2026, Last Modified: 13 May 20264th ALVR SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: VLMs, Image–Text Alignment, Text-to-Image Generation, Multimedia, Visual Narrative Freedom

TL;DR: Visual narrative freedom emerges in loosely coupled image–text relationships, allowing text-to-image systems to exploit this looseness and produce images sometimes preferred over authentic news photographs, despite their stylistic disadvantages.

Abstract: Mapping linguistic to visual representations is a central objective in vision-language model (VLM) development. Consequently, current VLM benchmarks and training objectives predominantly optimize for literal, descriptive correspondence between modalities. However, in real-world multimedia contexts, text and images are rarely strictly redundant; they are often loosely coupled, interacting at higher narrative levels. To capture this dynamic, we introduce Visual Narrative Freedom (VNF): the degree to which a generative system can produce multiple plausible visual realizations from an underdetermined textual input. In this paper, we systematically evaluate text-to-image (T2I) generation across a continuum of linguistic constraints, conditioning models on visual descriptions, captions, and full news articles. Using perceptual, structural, and semantic metrics, we first demonstrate that as textual constraints loosen (and VNF increases), T2I models produce images that are significantly more diverse from ground-truth reference photographs. Human evaluation further reveals that providing generation models with greater visual narrative freedom significantly increases the likelihood that their outputs are preferred over authentic news photographs, despite the known stylistic limitations of AI imagery. Ultimately, these findings suggest that the creation of persuasive multimodal misinformation is more imminent than evaluations of T2I systems based solely on text descriptions may indicate. This highlights the need for VLM evaluation frameworks that better capture the underdetermined nature of real-world image-text alignment.

Submission Number: 43

Loading