TL;DR: We propose methods to optimize adaptive attacks against content watermarks for language models and demonstrate the necessity to test robustness against adaptive attacks.a
Abstract: Large Language Models (LLMs) can be misused to spread unwanted content at scale. Content watermarking deters misuse by hiding messages in content, enabling its detection using a secret *watermarking key*. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only against *non-adaptive* attackers who lack knowledge of the watermarking method and can find only suboptimal attacks. We formulate watermark robustness as an objective function and use preference-based optimization to tune *adaptive* attacks against the specific watermarking method. Our evaluation shows that (i) adaptive attacks evade detection against all surveyed watermarks, (ii) training against *any* watermark succeeds in evading unseen watermarks, and (iii) optimization-based attacks are cost-effective. Our findings underscore the need to test robustness against adaptively tuned attacks. We release our adaptively tuned paraphrasers at <https://github.com/nilslukas/ada-wm-evasion>.
Lay Summary: Large Language Models (LLMs) like ChatGPT can produce realistic text that might be misused for spreading misinformation or spam. To address this issue, researchers use a technique called watermarking, which secretly embeds patterns into generated text, making it possible to detect and verify its origin. These watermarks are meant to be difficult to remove without significantly degrading the text quality. Our research demonstrates that current watermarking methods have a critical vulnerability: they are only tested against attackers who don't consider how the watermarking method works. Utilizing the knowledge of the watermarking algorithms, we developed a method to create optimized text rewriters using small publicly available models. Surprisingly, we found that all current watermarking systems can be evaded with over 96% success rate, even with limited computing resources (costing less than $10). Even more concerning, our attack methods worked effectively against watermarking systems they weren't specifically designed for. These findings demonstrate the urgent need for more robust watermarking techniques, ensuring that LLM-generated content remains reliably traceable.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/nilslukas/ada-wm-evasion
Primary Area: Social Aspects->Safety
Keywords: watermarking, language models, robustness, adaptive attacks
Submission Number: 7141
Loading