An End-to-End Model For Logits Based Large Language Models Watermarking

KA HIM WONG; Jicheng Zhou; Jiantao Zhou; Yain-Whar Si

An End-to-End Model For Logits Based Large Language Models Watermarking

KA HIM WONG, Jicheng Zhou, Jiantao Zhou, Yain-Whar Si

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM watermarking, End-to-end optimization, Robustness

TL;DR: We introduce the first logits-based end-to-end model for LLM watermarking, where encoder and decoder networks are jointly optimized to improve detection robustness and text quality.

Abstract: The rise of large language models (LLMs) has increased concerns over source tracing and copyright protection for AI-generated content (AIGC), highlighting the need for advanced detection technologies. Passive detection methods usually face high false positives, while active watermarking techniques using logits or sampling manipulation offer more effective protection. Existing LLM watermarking methods, though effective on unaltered content, suffer significant performance drops when the text is modified and could introduce biases that degrade LLM performance in downstream tasks. These methods fail to achieve an optimal tradeoff between text quality and robustness, particularly due to the lack of end-to-end optimization of the encoder and decoder. In this paper, we introduce the first end-to-end logits perturbation method for watermarking LLM-generated text. By jointly optimizing the encoder and decoder, our approach achieves a better balance between quality and robustness. To address non-differentiable operations in the end-to-end training pipeline, we introduce an online prompting technique that leverages the on-the-fly LLM as a differentiable surrogate. Our method demonstrates superior detection robustness, consistently outperforming state-of-the-art (SOTA) methods by 1.2\%, 4.0\%, and 5.5\% across 3 LLMs, averaged over 6 types of text distortions. Simultaneously, our approach achieves exceptional text quality, as evidenced by reduced text perplexity and improved performance in the downstream tasks with a margin of 19.2\% and 3.03\%. Our method can be easily generalized to different LLMs. The code is available in supplementary material.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9685

Loading