Watermark Smoothing Attacks against Language Models

Hongyan Chang; Hamed Hassani; Reza Shokri

Watermark Smoothing Attacks against Language Models

Hongyan Chang, Hamed Hassani, Reza Shokri

Published: 06 Mar 2025, Last Modified: 18 Apr 2025WMARK@ICLR2025EveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 9 pages)

Keywords: Watermark, Language models

Abstract: Watermarking is a key technique for detecting AI-generated text. In this work, we study its vulnerabilities and introduce the Smoothing Attack, a novel watermark removal method. By leveraging the relationship between the model’s confidence and watermark detectability, our attack selectively smoothes the watermarked content, erasing watermark traces while preserving text quality. We validate our attack on open-source models ranging from 1.3B to 30B parameters on 10 different water- marks, demonstrating its effectiveness. Our findings expose critical weaknesses in existing watermarking schemes and highlight the need for stronger defenses.

Presenter: ~Hongyan_Chang1

Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.

Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 43

Loading