Baselines for Identifying Watermarked Large Language Models

Published: 20 Jun 2023, Last Modified: 07 Aug 2023AdvML-Frontiers 2023EveryoneRevisionsBibTeX
Keywords: Large language models, watermarks, security, cryptography
TL;DR: We introduce theory and a suite of three algorithms for determining the presence of watermarks in large language models
Abstract: We consider the emerging problem of identifying the presence of watermarking schemes in publicly hosted, closed source large language models (LLMs). Rather than determine if a given text is generated by a watermarked language model, we seek to answer the question of if the model itself is watermarked. We introduce a suite of baseline algorithms for identifying watermarks in LLMs that rely on analyzing distributions of output tokens and logits generated by watermarked and unmarked LLMs. Notably, watermarked LLMs tend to produce token distributions that diverge qualitatively and identifiably from standard models. Furthermore, we investigate the identifiability of watermarks at varying strengths and consider the tradeoffs of each of our identification mechanisms with respect to watermarking scenario.
Supplementary Material: zip
Submission Number: 83
Loading