Soft Prompts Go Hard: Steering Visual Language Models with Hidden Meta-Instructions

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: security, machine learning, adversarial perturbations, large language models
TL;DR: We introduce a new type of indirect, cross-modal injection attacks against VLMs to influence how the model interprets the image and steer its outputs to express an adversary-chosen style, sentiment, or point of view.
Abstract:

We introduce a new type of indirect, cross-modal injection attacks against language models that operate on images: hidden "meta-instructions" that influence how the model interprets the image and steer its outputs to express an adversary-chosen style, sentiment, or point of view. We create meta-instructions by generating images that act as soft prompts. In contrast to jailbreaking attacks and adversarial examples, outputs produced in response to these images are plausible and based on the visual content of the image, yet also satisfy the adversary's (meta-)objective. We evaluate the efficacy of meta-instructions for multiple models and adversarial meta-objectives, and demonstrate how they "unlock" capabilities of the underlying language models that are unavailable via explicit text instructions. We describe how meta-instruction attacks could cause harm by enabling creation of self-interpreting content that carries spam, misinformation, and spin.

Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7440
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview