Abstract: Although the generation ability of large language models (LLMs) has achieved unprecedented capabilities, they also lead to potential risks of misuse, including disinformation, academic dishonesty, and copyright violations. LLM watermark has emerged as a promising solution for text attribution. However, with watermark robustness increasing, a critical vulnerability emerges: spoofing attacks, where attackers exploit robust watermarks to maliciously alter content while preserving the watermark’s detectability, potentially damaging the LLM owner’s reputation. We introduce a checksum-based dual verification approach that can be deployed on SOTA watermark algorithms with minimal interference. Our approach preserves text quality while providing both coarse-grained detection watermark and fine-grained verification of content integrity, enabling rapid detection of the integrity of the generated text and localization of editing positions. Compared with Bileve, in terms of watermark detectability after One-Token Attack, our method applied to the host watermark improves the F1-score from 0.730 to 0.916; in terms of integrity detection, our method also improves the F1-score; in addition, the perplexity of the watermark text generated by our method is close to natural text.
External IDs:doi:10.1007/978-981-95-4381-6_24
Loading