Semantic-Oriented Robust Text Watermark for Large Language Models

Semantic-Oriented Robust Text Watermark for Large Language Models

ACL ARR 2024 December Submission344 Authors

13 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Text watermark focuses on injecting identifiable information into the generated content, which has become increasingly important with the rapid development of Large Language Models (LLMs). Existing watermarking works either divide the vocabulary of LLMs into "green" and "red" tokens for the watermark generation (i.e., token-level watermark), or use the distance of generated sentence embeddings to distinguish the "green" and "red" partitions (i.e., sentence-level watermark). Despite the achieved progress, existing methods are still vulnerable when dealing with attacking or Out-Of-Distribution (OOD) generalization. To this end, we focus on sentence-level watermark and propose a novel Semantic-oriented Robust Text Watermark for LLMs (SoTW). Specifically, we first employ a pre-trained embedding model to obtain representations of generated sentences. Then, different from existing sentence-level works, we design a novel Semantic Quantization AutoEncoder (SQAE) to generate discrete representations for the partitions. Moreover, a semantic loss and a consistency loss are developed to ensure the generalization and robustness of generated watermarks. Furthermore, we develop an easy-to-use detection method for our proposed SoTW. Extensive experiments with two LLMs over two publicly available datasets demonstrated the robustness of SoTW in different attack methods and OOD settings. As a bypass, we release the code to facilitate the community.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: misinformation detection; security

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 344

Loading