Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: A Approach to attack LLM Watermarks by find the green token with self-information
Abstract: Text watermarking aims to subtly embeds statistical signals into text by controlling the Large Language Model (LLM)'s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current watermarking algorithms. The experimental results show SIRA achieves nearly 100\% attack success rates on seven recent watermarking methods with only \$0.88 per million tokens cost. Our approach does not require any access to the watermark algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the attack model even mobile-level models. Our findings highlight the urgent need for more robust watermarking.
Lay Summary: The rapid advancement of Large Language Models (LLMs) has brought concerns about their potential misuse, such as spreading misinformation and threatening academic integrity. To address this, text watermarking has emerged as a promising solution, subtly embedding undetectable patterns into LLM-generated text to verify its origin. However, the effectiveness of these watermarks depends on their robustness against attacks that try to remove them. Existing attack methods are often inefficient, untargeted, resource-intensive, and not easily transferable across different LLMs. Our research introduces the Self-Information Rewrite Attack (SIRA), a novel and efficient paraphrasing attack that reveals a fundamental vulnerability in current text watermarking algorithms. We discovered that watermarking techniques embed patterns in "high-entropy" tokens—tokens with high self-information due to their unpredictability and low probability. SIRA exploits this by calculating the self-information of each token to identify and mask these potential watermark-carrying tokens. We then use an LLM to perform a targeted "fill-in-the-blank" task, rewriting the masked text while preserving its semantic integrity. SIRA represents a significant step forward in understanding and evaluating the robustness of LLM watermarking. Our experiments show that SIRA achieves nearly 100% attack success rates across seven recent watermarking methods, at a very low cost of $0.88 per million tokens. This attack doesn't require any prior knowledge of the watermark algorithm or the LLM used, and it's highly transferable, even working with smaller, mobile-level models. By exposing this widespread vulnerability, our work highlights the urgent need for developing more robust and adaptive watermarking approaches to ensure transparency and integrity in AI-generated content.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/Allencheng97/Self-information-Rewrite-Attack/tree/main
Primary Area: Deep Learning->Large Language Models
Keywords: LLM watermark, robustness, privacy and societal considerations
Submission Number: 3589
Loading