Fragile by Design: Formalizing Watermarking Tradeoffs via Paraphrasing

Published: 05 Jun 2025, Last Modified: 15 Jul 2025ICML 2025 Workshop TAIG PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Governance, LLM Verification, Watermarking, Paraphrasing, Robustness, Imperceptibility
TL;DR: Watermarks in language models can't be both robust and invisible under paraphrasing—this paper proves it and introduces an ε–δ framework to quantify the trade-off.
Abstract: Verification is a cornerstone of technical AI governance, enabling auditability, attribution, and accountability in AI-generated content. As generative models proliferate, watermarking has emerged as a leading strategy for tracing provenance. However, advanced phrasing methods pose a serious threat: they can erase watermarks without altering meaning. We model watermarking under paraphrasing as an adversarial game and prove a no-go theorem: under idealized conditions, no watermark can be both robust and imperceptible. Even with imperfect paraphrasers, robustness is fragile and easily broken. To navigate this tension, we propose the $\varepsilon$-$\delta$ framework, which quantifies the trade-off between robustness ($\varepsilon$) and semantic distortion ($\delta$). Our findings highlight a key asymmetry: removing a watermark is often easier than embedding one that survives. The $\varepsilon$-$\delta$ framework offers a principled foundation for evaluating watermarking in adversarial, safety-critical settings.
Submission Number: 35
Loading