RLSpoofer: A Sample-Efficient Black-Box Spoofing Attack for Stress-Testing LLM Watermarks

Hanbo Huang; Xuan Gong; Yiran Zhang; Hao Zheng; Wenbin Dai; Jieren Kuang; Shiyu Liang

RLSpoofer: A Sample-Efficient Black-Box Spoofing Attack for Stress-Testing LLM Watermarks

Hanbo Huang, Xuan Gong, Yiran Zhang, Hao Zheng, Wenbin Dai, Jieren Kuang, Shiyu Liang

Published: 03 Jun 2026, Last Modified: 12 Jun 2026AI4GOOD Workshop 2026 SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Watermarks, spoofing attack

Abstract: Large language model (LLM) watermarking has emerged as a promising approach for detecting and attributing AI-generated text, yet its robustness to black-box spoofing remains insufficiently evaluated. Existing evaluation methods often demand extensive datasets and white-box access to algorithmic internals, limiting their practical applicability. In this paper, we study watermark resilience against spoofing from a distributional perspective. We first establish a *local capacity bottleneck*, which theoretically characterizes the probability mass that can be reallocated under KL-bounded local updates with semantic-fidelity constraints. Motivated by this, we propose RLSpoofer, a reinforcement learning-based black-box spoofing attack that requires only 100 human-watermarked paraphrase training pairs and zero access to the watermarking internals or detectors. Despite weak supervision, it empowers a 4B model to achieve a 62.0% spoof success rate with small semantic shift on PF-marked texts, far exceeding the 6% of baseline methods trained on up to 10,000 samples. Our findings expose the weaknesses in spoofing resistance of current LLM watermarking paradigms, providing a sample-efficient evaluation framework and underscoring the urgent need for more robust schemes.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 227

Loading