Toward Resilient Watermark Detection: Stability-Aware Statistical Features for Machine-Generated Text
Keywords: Watermarking, paraphrasing, robustness
Abstract: The widespread adoption of large language models (LLMs) has intensified the demand for principled methods to distinguish human- from machine-generated text. Watermarking provides a promising avenue, yet existing detectors exhibit sharp performance deterioration under multiple paraphrasing and when applied to shorter texts. We introduce Pattern Stability Score (PSS), a novel detection framework that leverages local statistical features and stability dynamics across paraphrased variants. Specifically, the proposed method combines global and local z-score features with higher-order statistics of run-length patterns, enriched by autocorrelation signals and stability scores computed over paraphrase depth. Numerical evaluations are performed on PG-19, a large-scale long-form benchmark while systematically stress-testing robustness under up to eight rounds of paraphrasing with Mistral-7B. Compared to prior z-score thresholding baselines, our approach improves detection AUC (area under the receiver operating characteristic curve) by over 10–15 percentage points across different token lengths. Additionally, it achieves strong precision–recall balance and AUC greater than 0.95 at full length, demonstrating resilience where prior detectors collapse. Finally, sensitivity analysis is conducted on window size, stride, and token length to validate design choices. Overall, these empirical results establish PSS as a practical and extensible framework for watermark detection, highlighting stability-based features as a promising direction for safeguarding LLM outputs against potential adversarial paraphrasing.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13768
Loading