Attacking LLM Watermarks by Exploiting Their Strengths

Qi Pang; Shengyuan Hu; Wenting Zheng; Virginia Smith

Attacking LLM Watermarks by Exploiting Their Strengths

Qi Pang, Shengyuan Hu, Wenting Zheng, Virginia Smith

Published: 04 Mar 2024, Last Modified: 14 Apr 2024SeT LLM @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: watermarking, large language models, security, privacy-preserving machine learning

TL;DR: We reveal and evaluate new attack vectors that exploit the common properties of LLM watermarks.

Abstract: Advances in generative models have made it possible for AI-generated text, code, and images to mirror human-generated content in many applications. Watermarking, a technique that aims to embed information in the output of a model to verify its source, is useful for mitigating misuse of such AI-generated content. However, existing watermarking schemes remain surprisingly susceptible to attack. In particular, we show that desirable properties shared by existing LLM watermarking systems such as quality preservation, robustness, and public detection APIs can in turn make these systems vulnerable to various attacks. We rigorously study potential attacks in terms of common watermark design choices, and propose best practices and defenses for mitigation---establishing a set of practical guidelines for embedding and detection of LLM watermarks.

Submission Number: 55

Loading