Learning to watermark LLM-Generated Text Via Reinforcement Learning

Xiaojun Xu; Yuanshun Yao; Yang Liu

Learning to watermark LLM-Generated Text Via Reinforcement Learning

Xiaojun Xu, Yuanshun Yao, Yang Liu

Published: 06 Mar 2025, Last Modified: 16 Apr 2025WMARK@ICLR2025EveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 9 pages)

Keywords: LLM Watermark

Abstract: We study how to watermark LLM outputs, i.e. embedding algorithmically detectable signals into LLM-generated text to track misuse. Unlike the current mainstream methods that work with a fixed LLM, we expand the watermark design space by including the LLM tuning stage in the watermark pipeline. We propose a co-training framework based on reinforcement learning that iteratively (1) trains a detector to detect the generated watermarked text and (2) tunes the LLM to generate text easily detectable by the detector while keeping its normal utility. We empirically show that our watermarks are more accurate, robust, and adaptable (to new attacks) with no generation overhead. It also allows watermarked model open-sourcing. In addition, if used together with alignment, the extra overhead introduced is low -- we only need to train an extra reward model (i.e. our detector). We hope our work can bring more effort into studying a broader watermark design that is not limited to working with LLMs with unchanged model weights.

Presenter: ~Xiaojun_Xu1

Format: No, the presenting author is unable to, or unlikely to be able to, attend in person.

Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 7

Loading