Fragile by Design: Formalizing Watermarking Tradeoffs via Paraphrasing

Ali Falahati; Lukasz Golab

Fragile by Design: Formalizing Watermarking Tradeoffs via Paraphrasing

Ali Falahati, Lukasz Golab

Published: 05 Jun 2025, Last Modified: 15 Jul 2025ICML 2025 Workshop TAIG PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI Governance, LLM Verification, Watermarking, Paraphrasing, Robustness, Imperceptibility

TL;DR: Watermarks in language models can't be both robust and invisible under paraphrasing—this paper proves it and introduces an ε–δ framework to quantify the trade-off.

Abstract: Verification is a cornerstone of technical AI governance, enabling auditability, attribution, and accountability in AI-generated content. As generative models proliferate, watermarking has emerged as a leading strategy for tracing provenance. However, advanced phrasing methods pose a serious threat: they can erase watermarks without altering meaning. We model watermarking under paraphrasing as an adversarial game and prove a no-go theorem: under idealized conditions, no watermark can be both robust and imperceptible. Even with imperfect paraphrasers, robustness is fragile and easily broken. To navigate this tension, we propose the $\varepsilon$-$\delta$ framework, which quantifies the trade-off between robustness ($\varepsilon$) and semantic distortion ($\delta$). Our findings highlight a key asymmetry: removing a watermark is often easier than embedding one that survives. The $\varepsilon$-$\delta$ framework offers a principled foundation for evaluating watermarking in adversarial, safety-critical settings.

Submission Number: 35

Loading