Bypassing LLM Watermarks with Color-Aware SubstitutionsDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Watermarking approaches are proposed to identify if the text being circulated is human or large language model (LLM) generated. The state-of-the-art strategy of Kirchenbauer et al. (2023a) biases the LLM to generate specific “green” tokens. However, determining the robustness of this watermarking method is an open problem. Existing attack methods do not incorporate color information (if a token is green/not) and may fail to evade longer text watermark detection. We propose Self Color Testing-based Substitution (SCTS), the first “color-aware” attack. SCTS gets color information by strategically prompting the watermarked LLM and comparing output frequencies, using which it can determine token colors. It then substitutes green tokens with red ones. In our experiments, SCTS successfully evades watermark detection using fewer edits than related work. Additionally, we show both theoretically and empirically that SCTS can remove the watermark for arbitrarily long watermarked text.
Paper Type: long
Research Area: Generation
Contribution Types: Model analysis & interpretability, Theory
Languages Studied: English
Preprint Status: We plan to release a non-anonymous preprint in the next two months (i.e., during the reviewing process).
A1: yes
A1 Elaboration For Yes Or No: Section 8
A2: yes
A2 Elaboration For Yes Or No: LLM watermarks that are deployed can be circumvented. We hope our work spurs the design of more robust techniques for watermarking.
A3: yes
A3 Elaboration For Yes Or No: Abstract and Section 1
B: yes
B1: yes
B1 Elaboration For Yes Or No: Section 1, 4, 5, and 6.
B2: no
B2 Elaboration For Yes Or No: The codes and icons we use are publicly available and used only for research purposes.
B3: no
B3 Elaboration For Yes Or No: The codes and icons we use are publicly available and used only for research purposes. Our code will be released later for research purposes.
B4: no
B4 Elaboration For Yes Or No: The C4 dataset's RealNewsLike subset we use is public and widely used.
B5: n/a
B6: yes
B6 Elaboration For Yes Or No: Appendix D.1
C: yes
C1: yes
C1 Elaboration For Yes Or No: Appendix D.2
C2: yes
C2 Elaboration For Yes Or No: Section 5 and Appendix D
C3: no
C3 Elaboration For Yes Or No: Just a single run indicated by context
C4: yes
C4 Elaboration For Yes Or No: Section 5 and Appendix D
D: no
D1: n/a
D2: n/a
D3: n/a
D4: n/a
D5: n/a
E: yes
E1: no
E1 Elaboration For Yes Or No: AI assistants are only used for auxiliary and AI-generated content is carefully checked.
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview