everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Statistical watermarking offers a theoretically-sound method for distinguishing machine-generated texts. In this work, we first present a systematic theoretical analysis of the statistical limits of watermarking, by framing it as a hypothesis testing problem. We derive nearly matching upper and lower bounds for (i) the optimal Type II error under a fixed Type I error, and (ii) the minimum number of tokens required to watermark the output. Our rate of $\Theta(h^{-1} \log (1/h))$ for the minimum number of required tokens, where $h$ is the average entropy per token, reveals a significant gap between the statistical limit and the $O(h^{-2})$ rate achieved in prior works. To our knowledge, this is the first comprehensive statistical analysis of the watermarking problem. Building on our theory, we develop SEAL (Semantic-awarE speculAtive sampLing), a novel watermarking algorithm for practical applications. SEAL introduces two key techniques: (i) designing semantic-aware random seeds by leveraging a proposal language model, and (ii) constructing a maximal coupling between the random seed and the next token through speculative sampling. Experiments on open-source benchmarks demonstrate that our watermarking scheme delivers superior efficiency and tamper resistance, particularly in the face of paraphrase attacks.