Provably Robust Watermarks for Open-Source Language Models

Provably Robust Watermarks for Open-Source Language Models

TMLR Paper7736 Authors

02 Mar 2026 (modified: 30 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Watermarking is a leading solution for the increasingly pressing problem of identifying AI-generated text. Existing large language model (LLM) watermark approaches crucially rely on the LLM's source code and parameters being secret, which makes them ineffective in an open-source setting. In this work, we introduce the first watermarking scheme for open-source LLMs with provable robustness guarantees. Under precisely defined assumptions about the adversary's knowledge, we prove that the adversary either fails to remove the watermark or significantly degrades the quality of the model. We supplement our theoretical results with experiments using Qwen, which show how our proven robustness-quality tradeoff manifests in practice. Our main contribution is showing the feasibility of watermarks with provable guarantees in the open-source setting. We provide the first formal definition of robustness in this setting, and show that it is achievable by a fairly simple scheme. While this scheme is simple, the bulk of our work lies in modeling the problem in a way that is realistic yet amenable to provable results, and analyzing our scheme to prove robustness. We hope that our definitions and the techniques used in our analysis pave the way for future work on open-source watermarks.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We re-ran all experiments using Qwen and updated the paper accordingly.

Assigned Action Editor: ~Alessandro_De_Palma1

Submission Number: 7736

Loading