Using Multimodal Foundation Models and Clustering for Improved Style Ambiguity Loss

TMLR Paper2865 Authors

13 Jun 2024 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Teaching text-to-image models to be creative involves using style ambiguity loss, which requires a pretrained classifier. In this work, we explore a new form of the style ambiguity training objective, used to approximate creativity, that does not require training a classifier or even a labeled dataset. We then train a diffusion model to maximize style ambiguity to imbue the diffusion model with creativity and find our new methods improve upon the traditional method, based on automated metrics for human judgment, while still maintaining creativity and novelty.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=hD0enTLMMI
Changes Since Last Submission: > Added more details about training such as hardware, training times, and environmental impact > Provided exact names of checkpoints used for foundation and scoring models > Added diagrams showing Creative Adversarial Network architecture > Retrained models and used new images from new retrained models and reported new results > Repeated experiments for lower-dimensional images
Assigned Action Editor: ~Dmitry_Kangin1
Submission Number: 2865
Loading