Keywords: Cross-modal Information Transformation, Generative Models, Optimal Transport
Abstract: Deep generative models, such as vision-language models (VLMs) and diffusion models (DMs), have achieved remarkable success in cross-modality generation tasks. However, the cyclic transformation of text $\rightarrow$ image $\rightarrow$ text often fails to secure an exact match between the original and the reconstructed content. In this work, we attempt to address this challenge by utilizing a deterministic function to guide the reconstruction of precise information via generative models. Using a color histogram as guidance, we first identify a soft prompt to generate the desired text using a language model and map the soft prompt to a target histogram. We then utilize the target color histogram as a constraint for the diffusion model and formulate the intervention as an optimal transport problem. As a result, the generated image has the exact color histogram as the target, which can be converted to a soft prompt deterministically for reconstructing the text. This allows the generated images to entail arbitrary forms of text (e.g., natural text, code, URLs, etc.) while ensuring the visual content is as natural as possible. Our method offers significant potential for applications on histogram-constrained generation, such as steganography and conditional generation in latent space with semantic meanings.
Supplementary Material: zip
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10015
Loading