Abstract: Watermark algorithms for large language models (LLMs) have achieved extremely
high accuracy in detecting text generated by LLMs. Such algorithms typically
involve adding extra watermark logits to the LLM’s logits at each generation
step. However, prior algorithms face a trade-off between attack robustness and
security robustness. This is because the watermark logits for a token are determined
by a certain number of preceding tokens; a small number leads to low security
robustness, while a large number results in insufficient attack robustness. In
this work, we propose a semantic invariant watermarking method for LLMs that
provides both attack robustness and security robustness. The watermark logits in
our work are determined by the semantics of all preceding tokens. Specifically, we
utilize another embedding LLM to generate semantic embeddings for all preceding
tokens, and then these semantic embeddings are transformed into the watermark
logits through our trained watermark model. Subsequent analyses and experiments
demonstrated the attack robustness of our method in semantically invariant settings:
synonym substitution and text paraphrasing settings. Finally, we also show that our
watermark possesses adequate security robustness. Our code and data are available
at https://github.com/THU-BPM/Robust_Watermark. Additionally, our algorithm
could also be accessed through MarkLLM (Pan et al., 2024).
.
Loading