Keywords: LLM watermark, robustness, safety ai, paraphrasing attack
TL;DR: A Lightweight Approach to remove LLM Watermarks by find the green token with self-information
Abstract: Text watermarking is designed to embed hidden, imperceptible, markers within
content generated by large language models (LLMs), with the goal of tracing and
verifying the content’s origin to prevent misuse. The robustness of watermarking
algorithms has become a key factor in evaluating their effectiveness, but remains
an open problem. In this work, we introduce a novel watermark removal attack,
the Self-Information Rewrite Attack (SIRA), which poses a new challenge to the
robustness of existing watermarking techniques. Since embedding watermarks
requires both concealment and semantic coherence, current methods prefered to
embed them in high-entropy tokens. However, this reveals an inherent vulnera-
bility, allowing us to exploit this feature to identify potential green tokens. Our
approach leverages the self-information of each token to filter potential pattern to-
kens that embed watermarks and performs the attack through masking and rewrit-
ing in a black-box setting. We demonstrate the effectiveness of our attack by
implementing it against seven recent watermarking algorithms. The experimental
results show that our lightweight algorithm achieves state-of-the-art attack success
rate while maintaining shorter execution times and lower computational resource
consumption compared to existing methods. This attack points to an important
vulnerability of existing watermarking techniques and paves way towards future
watermarking improvements.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9394
Loading