Optimizing Adaptive Attacks against Content Watermarks for Language Models

Abdulrahman Diaa; Toluwani Aremu; Nils Lukas

Optimizing Adaptive Attacks against Content Watermarks for Language Models

Abdulrahman Diaa, Toluwani Aremu, Nils Lukas

Published: 06 Mar 2025, Last Modified: 16 Apr 2025WMARK@ICLR2025EveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 9 pages)

Keywords: Generative AI, LLM, Watermark, Adversarial Attack

TL;DR: We propose an adaptive evasive attack against language model watermarks

Abstract: Large Language Models (LLMs) can be misused to spread online spam and misinformation. Content watermarking deters misuse by hiding a message in generated outputs, enabling detection using a secret \emph{watermarking key}. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only against \emph{non-adaptive} attackers who lack knowledge of the provider's watermarking method and can find only suboptimal attacks. We formulate the robustness of LLM watermarking as an objective function and use preference-based optimization to tune \emph{adaptive} attacks against the specific watermarking method. Our evaluation shows that: (i) adaptive attacks evade detection against all surveyed watermarking methods. (ii) Even in a non-adaptive setting, attacks optimized adaptively against known watermarks remain effective when tested on unseen watermarks, and (iii) optimization-based attacks are scalable and use limited computational resources of less than seven GPU hours. Our findings underscore the need to test robustness against adaptive attacks.

Presenter: ~Abdulrahman_Diaa1

Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.

Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 22

Loading