Iteratively Prompting Multimodal LLMs to Reproduce Natural and AI-Generated Images

Ali Naseh; Katherine Thai; Mohit Iyyer; Amir Houmansadr

Iteratively Prompting Multimodal LLMs to Reproduce Natural and AI-Generated Images

Ali Naseh, Katherine Thai, Mohit Iyyer, Amir Houmansadr

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Data, Safety

Keywords: Trustworthy ML, Text-to-Image models

TL;DR: This paper presents a novel method using multi-modal AI models to efficiently replicate prompts from popular text-to-image APIs, highlighting potential new threats to AI-generated and natural image markets.

Abstract: With the digital imagery landscape rapidly evolving, image stocks and AI-generated image marketplaces have become central to visual media. Traditional stock images now exist alongside innovative platforms that trade in prompts for AI-generated visuals, driven by sophisticated APIs like DALL-E 3 and Midjourney. This paper studies the possibility of employing multi-modal models with enhanced visual understanding to mimic the outputs of these platforms, introducing an original attack strategy. Our method leverages fine-tuned CLIP models, a multi-label classifier, and the descriptive capabilities of GPT-4V to create prompts that generate images similar to those available in marketplaces and from premium stock image providers, yet at a markedly lower expense. In presenting this strategy, we aim to spotlight a new class of economic and security considerations within the realm of digital imagery. Our findings, supported by both automated metrics and human assessment, reveal that comparable visual content can be produced for a fraction of the prevailing market prices (\$0.23 - \$0.27 per image), emphasizing the need for awareness and strategic discussions about the integrity of digital media in an increasingly AI-integrated landscape. Additionally, this approach holds promise as a tool for data augmentation, potentially enhancing machine learning models by providing varied and cost-effective training data. Our work also contributes to the field by assembling a dataset consisting of approximately 19 million prompt-image pairs generated by the popular Midjourney platform, which we plan to release publicly.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Flagged For Ethics Review: true

Ethics Comments: The method in this paper can be used to plagiarize copy-righted artworks. The authors should discuss the potential impact of using this method. Generating images that look like real stock images may come with some ethical implications. While the authors present this as an attack strategy, in the wrong hands, the strategy can be misused. Of course, this is a danger posed by many modern deep learning approaches.

Submission Number: 1280

Loading