Edit Distance Robust Watermarks via Indexing Pseudorandom Codes

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Watermarking, pseudorandom codes, language models
TL;DR: We introduce the first undetectable watermarking schemes for language models with robustness to adversarial substitutions, insertions and deletions.
Abstract: Motivated by the problem of detecting AI-generated text, we consider the problem of watermarking the output of language models with provable guarantees. We aim for watermarks which satisfy: (a) undetectability, a cryptographic notion introduced by Christ, Gunn, & Zamir (2023) which stipulates that it is computationally hard to distinguish watermarked language model outputs from the model's actual output distribution; and (b) robustness to channels which introduce a constant fraction of adversarial insertions, substitutions, and deletions to the watermarked text. Earlier schemes could only handle stochastic substitutions and deletions, and thus we are aiming for a more natural and appealing robustness guarantee that holds with respect to edit distance. Our main result is a watermarking scheme which achieves both (a) and (b) when the alphabet size for the language model is allowed to grow as a polynomial in the security parameter. To derive such a scheme, we follow an approach introduced by Christ & Gunn (2024), which proceeds via first constructing pseudorandom codes satisfying undetectability and robustness properties analogous to those above; our codes have the additional benefit of relying on weaker computational assumptions than used in previous work. Then we show that there is a generic transformation from such codes over large alphabets to watermarking schemes for arbitrary language models.
Primary Area: Learning theory
Submission Number: 15071
Loading