Track: long paper (up to 9 pages)
Keywords: LLM, LLM Watermark, Modification Detection, Fragile Watermark
Abstract: Misusing the large language models (LLMs) has intensified the need for robust generated-text detection through watermarking. Existing watermark methods prioritize robustness but remain vulnerable to spoofing attacks, where modified text retains detectable watermarks, falsely attributing malicious content to the LLM. We propose the Multiple-Sampling Fragile Watermark (MSFW), the first framework to integrate local fragile watermarks to defend against such attacks. By embedding context-dependent watermarks through a multiple-sampling strategy, MSFW enables two critical detection capabilities: (1) Modification detection via localized watermark fragility, where any modification disrupts adjacent watermark and reflectd through localized watermark extraction; (2) Generated-text detection using unaffected global watermarks. Meanwhile, our watermarking method is unbiased and improves the diversity of the output by the multiple-sampling strategy. This work bridges the gap between robustness and fragility in LLM watermarking, offering a practical defense against spoofing attacks without compromising utility.
Presenter: ~Yuhang_Cai4
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 20
Loading