Machine never said that: Defending spoofing attacks by diverse fragile watermark

Published: 06 Mar 2025, Last Modified: 16 Apr 2025WMARK@ICLR2025EveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 9 pages)
Keywords: LLM, LLM Watermark, Modification Detection, Fragile Watermark
Abstract: Misusing the large language models (LLMs) has intensified the need for robust generated-text detection through watermarking. Existing watermark methods prioritize robustness but remain vulnerable to spoofing attacks, where modified text retains detectable watermarks, falsely attributing malicious content to the LLM. We propose the Multiple-Sampling Fragile Watermark (MSFW), the first framework to integrate local fragile watermarks to defend against such attacks. By embedding context-dependent watermarks through a multiple-sampling strategy, MSFW enables two critical detection capabilities: (1) Modification detection via localized watermark fragility, where any modification disrupts adjacent watermark and reflectd through localized watermark extraction; (2) Generated-text detection using unaffected global watermarks. Meanwhile, our watermarking method is unbiased and improves the diversity of the output by the multiple-sampling strategy. This work bridges the gap between robustness and fragility in LLM watermarking, offering a practical defense against spoofing attacks without compromising utility.
Presenter: ~Yuhang_Cai4
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 20
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview