SPO: A Black-box, Unbiased, Robust Watermarking Method for Large Language Model

ICLR 2026 Conference Submission10889 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Watermark, Larage Language Model, Unbiased Watermark, Black-box Watermark
Abstract: Large language models (LLMs) have revolutionary impacts on text generation. Despite their widespread application, LLMs raise significant ethical and security concerns about potential misuse, such as fake news and malicious content. Watermarking technology is known as a crucial means of distinguishing generated content and then mitigate misuse. Existing watermarking methods have their respective strengths and weaknesses, but it remains a challenge to achieve a balance between black-box embedding, unbiased output, and robustness. To address this limitation, we propose a novel black-box watermarking method called the Sampling and Prioritizing Output method (SPO). Through prioritizing the allocation of watermarked tokens over non-watermarked tokens, the SPO method maximizes the number of watermarked tokens within the designated watermarked subspace. Subsequently, the method randomly samples an output token from this subspace to effectively embed the watermark. As a black-box approach, the SPO method does not rely on detailed model parameters for watermark embedding and effectively safeguards intellectual copyrights of LLMs. Extensive experimental results and theoretical analysis indicate that the SPO method is an unbiased method that embeds the watermark without compromising the quality of generated content. Furthermore, it exhibits superior detectability and robustness compared to existing unbiased watermarking methods. This achievement addresses remarkable advantages over current unbiased methodologies, providing a possible solution that adapts better to real-world scenarios.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 10889
Loading