Discovering Spoofing Attempts on Language Model Watermarks

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. Despite recent work demonstrating that state-of-the-art schemes are, in fact, vulnerable to spoofing, no prior work has focused on post-hoc methods to discover spoofing attempts. In this work, we for the first time propose a reliable statistical method to distinguish spoofed from genuinely watermarked text, suggesting that current spoofing attacks are less effective than previously thought. In particular, we show that regardless of their underlying approach, all current learning-based spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts and thus demonstrate that a watermark has been spoofed. Our experimental evaluation shows high test power across all learning-based spoofing methods, providing insights into their fundamental limitations and suggesting a way to mitigate this threat.
Lay Summary: Detecting unauthorized uses of AI-generated text is crucial, especially given the growing risk of potential misuse (academic cheating, automated disinformation campaigns…). A promising way to identify such texts involves adding hidden signals in the generated text, or watermarks, to attribute their origin to specific language models, thus holding the model providers accountable for misuse. However, recent methods have emerged allowing attackers to forge these watermarks, falsely attributing texts to certain models and threatening the credibility of this identification method. In our research, we discovered that current forgery methods—despite their apparent effectiveness—consistently leave subtle but detectable traces in the text. By studying these traces, we developed statistical tests that can reliably differentiate between genuinely watermarked texts and forged ones. Our experiments showed that our approach works effectively across various watermarking schemes and forgery methods. This research provides practical tools to maintain the trustworthiness of AI-generated text, empowering model providers to detect and eliminate forgery attempts, thus making it significantly more difficult for attackers to discredit watermarks through forgery.
Link To Code: https://github.com/eth-sri/watermark-spoofing-detection
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: LLM, watermarks, watermark spoofing
Submission Number: 7297
Loading