Keywords: LLM, watermark
Abstract: With the release of powerful open-source large language models (LLMs), post-training on such models for applications is becoming increasingly prevalent. To enable ownership claims and track the potential misuse of these models after post-training, planting detectable watermarks has become an essential task. In the context of open-source models, users have complete white-box access, allowing them to freely alter the model's outputs. Consequently, some watermarking techniques, such as generation time watermarks, are ineffective. Therefore, we propose WindTalkers, a watermarking technique that is planted into the model's weights and remains robust against common post-training techniques such as reinforcement learning (RL) and supervised fine-tuning (SFT). We employ a specific cipher-like encoding method to process the instructions within the training dataset. This encoding is designed to be only recognizable by the watermarked model, thereby enabling a clear distinction between models that have been watermarked and those that have not. Experimental results demonstrate that our method does not compromise the model's general performance and maintains its robustness through various post-training procedures.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24361
Loading