TL;DR: We propose embedding multi-bit information into language model outputs for tackling malicious misuse of large language models.
Abstract: We show the viability of tackling misuses of large language models beyond the identification of machine-generated text. While existing methods focus on detection only, some malicious misuses demand tracing the adversary user for counteracting them. To address this, we propose Multi-bit Watermark via Position Allocation, embedding traceable multi-bit information during language model generation. Leveraging the benefits of zero-bit watermarking (Kirchenbauer et al., 2023a),, our method enables robust extraction of the watermark without any model access, embedding and extraction of long messages ($\geq$ 32-bit) without finetuning, and maintaining text quality, while allowing zero-bit detection all at the same time. Moreover, our watermark is relatively robust under strong attacks like interleaving human texts and paraphrasing. We compare with existing works to show the effectiveness of our scheme in terms of robustness and latency.
Paper Type: long
Research Area: NLP Applications
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
0 Replies
Loading