Optimizing Adaptive Attacks against Content Watermarks for Language Models

Published: 06 Mar 2025, Last Modified: 16 Apr 2025WMARK@ICLR2025EveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 9 pages)
Keywords: Generative AI, LLM, Watermark, Adversarial Attack
TL;DR: We propose an adaptive evasive attack against language model watermarks
Abstract: Large Language Models (LLMs) can be misused to spread online spam and misinformation. Content watermarking deters misuse by hiding a message in generated outputs, enabling detection using a secret \emph{watermarking key}. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only against \emph{non-adaptive} attackers who lack knowledge of the provider's watermarking method and can find only suboptimal attacks. We formulate the robustness of LLM watermarking as an objective function and use preference-based optimization to tune \emph{adaptive} attacks against the specific watermarking method. Our evaluation shows that: (i) adaptive attacks evade detection against all surveyed watermarking methods. (ii) Even in a non-adaptive setting, attacks optimized adaptively against known watermarks remain effective when tested on unseen watermarks, and (iii) optimization-based attacks are scalable and use limited computational resources of less than seven GPU hours. Our findings underscore the need to test robustness against adaptive attacks.
Presenter: ~Abdulrahman_Diaa1
Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 22
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview