Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

TMLR Paper6380 Authors

04 Nov 2025 (modified: 13 Nov 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Watermarking is emerging as a practical mechanism for provenance in language models, but it modifies token probabilities at inference time, the very same locus targeted by alignment training. This overlap raises a basic question relevant for deployment: how do watermark-induced shifts interact with the procedures intended to make models safe and useful? We conduct a systematic study across several contemporary models and two representative watermarking schemes. We find that watermarking induces a nontrivial, patterned yet model-specific shift in alignment. Two regimes recur: guard attenuation, where models become more helpful but less safe, and guard amplification, where refusals become overly conservative. Crucially, these effects persist even after controlling for perplexity degradation, indicating alignment-specific distortions beyond generalized quality loss. To mitigate these effects, we introduce Alignment Resampling (AR), a procedure that samples multiple watermarked outputs and selects the most aligned response according to an external reward model. Drawing on established results for the expected maximum of Gaussian random variables, we derive a theoretical lower bound showing that alignment gains grow sublogarithmically with sample size, providing principled guidance on minimal sampling requirements. Interestingly, we observe that sampling as few as two to four candidates largely restores unwatermarked alignment performance in truthfulness, safety, and helpfulness, while leaving watermark detectability essentially unchanged. This study offers the first systematic audit of watermarking-alignment interactions, quantifies the trade-off between watermark strength and alignment, and proposes a simple, inference-time mitigation procedure suitable for deployment.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Zhouxing_Shi1

Submission Number: 6380

Loading