DynamicBias: Sequence-Aware Calibrated Watermarking for Large Language Models

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: watermarking, robustness, safety
Abstract: Text watermarking has attracted significant research interest as a way to mitigate LLM-related harms by enabling reliable identification of machine-generated text. In particular, "green" and "red" vocabulary-partition watermarking, which uses a static bias to skew token sampling toward green tokens and away from red, is a promising approach. However, a persistent trade-off remains: stronger watermarks improve detectability but can harm quality, while weaker ones preserve quality but are harder to detect and easier to remove via paraphrasing. A key reason is that the static bias ignore heterogeneous logit distributions across models, domains, and languages, yielding inconsistent performance and hindering practical deployment. Our preliminary investigation shows substantial variability in these distributions and the associated performance disparities, driven by model certainty, measured as the margin between the top logit and the average of a small pool of next-best token logits. Building on this observation, we propose DynamicBias, which calibrates the bias at each step using a sequence-level average of this margin with a single scaling parameter $\alpha$. Theoretically, we show that DynamicBias admits a unique optimal $\alpha$ and increases expected detectability as the sequence-level margin grows. This calibration yields consistent detectability across models and integrates directly with existing vocabulary-partition watermarks, offering a practical solution for real-world deployment. Extensive experiments across four LLMs and three languages demonstrate improved detection with competitive text quality and stronger robustness to paraphrasing.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24424
Loading