PRO: Enabling Precise and Robust Text Watermark for Open-Source LLMs

ICLR 2026 Conference Submission21379 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Watermark algorithms, Large Language Models, Open-source LLMs
Abstract: Text watermarking for large language models (LLMs) is important for model owners to verify the origin and protect the intellectual property of AI-generated text. While watermarking methods for closed-source LLMs' text generation are relatively mature, watermarking open-source LLMs' text generation remains challenging. Closed-source model developers typically embed text watermarks during decoding; however, this approach is ineffective for the text generation of open-source models, where developers have no control over how decoding occurs. As a result, owners of open-source LLMs still lack practical methods to verify whether a given piece of AI-generated text originated from their models. The primary challenge lies in embedding watermarks directly into model weights without compromising detection accuracy. One possible solution is first to create a text generation watermark in the closed-source setting, then distill that watermark information into the publicly released model's weights. However, this approach faces two critical challenges: (i) Reduced detectability due to inconsistency between the watermark patterns learned by the model and the predefined patterns used during detection. This inconsistency arises because existing closed-source watermark patterns are difficult for models to learn effectively. (ii) Vulnerability to modifications by downstream users, such as fine-tuning or model merging, which may weaken or completely remove the embedded watermark. To address these challenges, we propose ***PRO***, a precise and robust text watermarking method for open-source LLMs. First, we introduce a trainable watermark policy model, which is jointly optimized with the LLM during training. This co-optimization helps generate watermark patterns that are easier for the model to learn, significantly reducing inconsistencies between generated patterns and predefined detection criteria. Additionally, we incorporate a regularization term into the watermarking loss, which simulates various perturbations (e.g., fine-tuning, model merging) and penalizes any degradation in watermark detectability under these modifications. This approach ensures that the embedded watermark remains resilient even after downstream model alterations. Our evaluation on mainstream open-source LLMs (e.g., LLaMA-3.2, LLaMA-3, and Phi-2) demonstrates that our approach significantly outperforms prior methods in terms of both watermark detectability and robustness against model modifications. The code is publicly available at https://anonymous.4open.science/r/PRO-DE2A/README.md.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21379
Loading