Keywords: LLM watermarks, RL-based removal attack
TL;DR: We propose a RL-based attack that can effectively remove watermarks from texts using only 100 training samples and zero access to watermark detectors
Abstract: Large language model (LLM) watermarking has shown promise in detecting AI-generated content and mitigating misuse, with prior work claiming robustness against paraphrasing and text editing. In this paper, we argue that existing evaluations are not sufficiently adversarial, obscuring critical vulnerabilities and overstating the security. To address this, we introduce the adaptive robustness radius, a formal metric that quantifies the worst-case resilience of watermarks against adaptive adversaries. By lifting the paraphrase space into a KL-divergence ball, we approximate this radius and theoretically demonstrate that optimizing the attack context and model parameters can significantly reduce the approximated radius, making watermarks highly vulnerable to paraphrase attacks. Leveraging this insight, we propose RLCracker, a reinforcement learning (RL)–based adaptive attack that erases watermarks while preserving semantic fidelity. RLCracker requires only *limited* watermarked examples and *zero* access to the detector. Despite weak supervision, it empowers a 3B model to achieve 98.5% removal success *with minimal semantic shift* on 1,500-token Unigram-marked texts after training on only *100* short samples. This performance dramatically exceeds 6.75% by GPT-4o and generalizes across five model sizes over ten watermarking schemes. Our results confirm that adaptive attacks are broadly effective and pose a fundamental threat to current watermarking defenses.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 4777
Loading