Keywords: watermarking, LLM, opensource
TL;DR: We propose a simple, training-free method to watermark open-source LLMs by modifying their unembedding layer, enabling reliable detection with minimal impact on performance.
Abstract: With the growing prevalence of large language model (LLM)-generated content, watermarking is considered a promising approach for attributing text to LLMs and distinguishing it from human-written content. A common class of techniques embeds subtle but detectable signals in generated text by modifying token sampling probabilities. However, such methods are unsuitable for open-source models, where users have white-box access and can easily disable watermarking during inference. Existing watermarking methods that support open-source models often rely on complex or compute-intensive training procedures. In this work, we introduce OpenStamp, a simple watermarking technique that implants detectable signals into the generated text by modifying just the final projection, or unembedding, layer. Through experiments across two models, we show that OpenStamp achieves superior detection performance, with minimal degradation in model capabilities. The implanted watermarking signal is harder to scrub off through post-hoc fine-tuning compared to previous methods, and offers similar robustness against paraphrasing attacks. We have shared our code through an anonymized repository to enable developers to easily watermark their models.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5236
Loading