SIRA: Exposing Vulnerabilities in Text Watermarking with Self-Information Rewrite Attacks

27 Sept 2024 (modified: 23 Jan 2025)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM watermark, robustness, safety ai, paraphrasing attack
TL;DR: A Lightweight Approach to remove LLM Watermarks by find the green token with self-information
Abstract: Text watermarking is designed to embed hidden, imperceptible, markers within content generated by large language models (LLMs), with the goal of tracing and verifying the content’s origin to prevent misuse. The robustness of watermarking algorithms has become a key factor in evaluating their effectiveness, but remains an open problem. In this work, we introduce a novel watermark removal attack, the Self-Information Rewrite Attack (SIRA), which poses a new challenge to the robustness of existing watermarking techniques. Since embedding watermarks requires both concealment and semantic coherence, current methods prefered to embed them in high-entropy tokens. However, this reveals an inherent vulnera- bility, allowing us to exploit this feature to identify potential green tokens. Our approach leverages the self-information of each token to filter potential pattern to- kens that embed watermarks and performs the attack through masking and rewrit- ing in a black-box setting. We demonstrate the effectiveness of our attack by implementing it against seven recent watermarking algorithms. The experimental results show that our lightweight algorithm achieves state-of-the-art attack success rate while maintaining shorter execution times and lower computational resource consumption compared to existing methods. This attack points to an important vulnerability of existing watermarking techniques and paves way towards future watermarking improvements.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9394
Loading