Probing into the Fine-grained Manifestation in Multi-modal Image SynthesisDownload PDF

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone
Keywords: Multi-modal image synthesis, semantic consistency measurement, robustness testing
Abstract: The ever-growing development of multi-modal image synthesis brings unprecedented realism to generation tasks. In practice, it is straightforward to judge the visual quality and reality of an image. However, it is labor-consuming to verify the correctness of semantic consistency in the auto-generation, which requires a comprehensive understanding and mapping of different modalities. The results of existing models are sorted and displayed largely relying on the global visual-text similarity. However, this coarse-grained approach does not capture the fine-grained semantic alignment between image regions and text spans. To address this issue, we first present a new method to evaluate the cross-modal consistency by inspecting the decomposed semantic concepts. We then introduce a new metric, called MIS-Score, which is designed to measure the fine-grained semantic alignment between a prompt and its generation quantitatively. Moreover, we have also developed an automated robustness testing technique with referential transforms to test and measure the robustness of multi-modal synthesis models. We have conducted comprehensive experiments to evaluate the performance of recent popular models for text-to-image generation. Our study demonstrates that the proposed metric MIS-Score represents better evaluation criteria than existing coarse-grained ones (e.g., CLIP) to understand the semantic consistency of the synthesized results. Our robustness testing method also proves the existence of biases embedded in the models, hence uncovering their limitations in real applications.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
TL;DR: A new method for evaluating the semantic consistency and robustness of multi-modal image synthesis models
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
7 Replies

Loading