Keywords: cycle consistency, multimodal learning, vision-language modeling, text-to-image generation, synthetic data
Abstract: The increasing exchange of image and text in large multimodal models leads us to ask: to what degree are mappings from text to image, and back, cycle-consistent? First, we find that current image-to-text models paired with text-to-image models do achieve a degree of perceptual cycle consistency, even when these models are not trained to have this effect. However, these mappings are far from perfect, motivating us to analyze in what ways they fail. First, we observe a strong correlation between cycle consistency and downstream performance in both image captioning and text-to-image generation. Next, we investigate how divergent are text-to-image mappings as a function of the number of objects described by the text, and how it affects achieving cycle consistency. Surprisingly, we find that more descriptive text leads to a a broader distribution of generated images, but also results in overall better reconstructions. Finally, we show possible challenges of training cycle consistent models due to the sensitivity of text-to-image models.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8962
Loading