Keywords: Watermarking, Language Models, On-Policy Learning
TL;DR: We propose a practical and theoretically grounded on-policy fine-tuning framework that advances the quality-detectability Pareto frontier of open-weight LM watermarking
Abstract: Open-weight language models raise acute challenges for watermarking because inference-time interventions cannot be enforced once model weights are public. Existing methods, such as the recently proposed **GaussMark**, typically involve subtly modifying model weights. While such schemes demonstrate that imperceptible perturbations can yield detectable signals, they require computationally intensive parameter searches and achieve only limited progress along the quality-detectability frontier. We introduce **MarkTune**, a theoretically principled on-policy fine-tuning framework that treats watermark detectability as a reward signal while regularizing against degradation in text quality. We instantiate our approach with **GaussMark** as a base watermarking scheme and demonstrate that **MarkTune** consistently improves the quality-detectability trade-off over vanilla **GaussMark** by adapting non-watermarked weights to maintain generation quality. Empirically, we show that **MarkTune** consistently advances the quality-detectability Pareto frontier: it improves true positive rates under fixed false positive thresholds, restores perplexity and benchmark accuracy to near-unwatermarked levels, and remains robust under paraphrasing and translation attacks. Together, these results establish on-policy fine-tuning as a general strategy for embedding robust, high-quality watermarks into open-weight LMs.
Submission Number: 15
Loading