LLM Watermark Evasion via Bias Inversion

Jeongyeon Hwang; He He; Sangdon Park; Jungseul Ok

LLM Watermark Evasion via Bias Inversion

Jeongyeon Hwang, He He, Sangdon Park, Jungseul Ok

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM watermarking, robustness, privacy

TL;DR: Our work proposes a new watermark‐evasion method that achieves over 99% evasion rates while preserving semantic fidelity, exposing vulnerabilities in LLM watermarking.

Abstract: Watermarking for large language models (LLMs) embeds a statistical signal during generation to enable detection of model-produced text. While watermarking has proven effective in benign settings, its robustness under adversarial evasion remains contested. To advance a rigorous understanding and evaluation of such vulnerabilities, we propose the *Bias-Inversion Rewriting Attack* (BIRA), which is theoretically motivated and model-agnostic. BIRA weakens the watermark signal by suppressing the logits of likely watermarked tokens during LLM-based rewriting, without any knowledge of the underlying watermarking scheme. Across recent watermarking methods, BIRA achieves over 99\% evasion while preserving the semantic content of the original text. Beyond demonstrating an attack, our results reveal a systematic vulnerability, emphasizing the need for stress testing and robust defenses.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 4491

Loading