De-mark: Watermark Removal in Large Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Watermarking techniques offer a promising way to identify machine-generated content via embedding covert information into the contents generated from language models (LMs). However, the robustness of the watermarking schemes has not been well explored. In this paper, we present De-mark, an advanced framework designed to remove n-gram-based watermarks effectively. Our method utilizes a novel querying strategy, termed random selection probing, which aids in assessing the strength of the watermark and identifying the red-green list within the n-gram watermark. Experiments on popular LMs, such as Llama3 and ChatGPT, demonstrate the efficiency and effectiveness of De-mark in watermark removal and exploitation tasks.
Lay Summary: As AI-generated text becomes more realistic and widespread, it's increasingly difficult to tell whether the content was written by a human or a machine. To solve this, researchers have developed watermarking techniques that subtly mark AI-generated text, making it detectable later. But how reliable are these marks? We introduce De-mark, a new method that can effectively remove these hidden watermarks, even when we don’t know how they were created. Our technique uses a novel probing strategy to reveal the hidden rules behind the watermark and reverse them. DE-MARK can also steal a watermarking scheme and apply it to another AI model. Our experiments show that De-mark works on some of the most powerful language models today. This raises important questions about the future of watermarking. While our research helps understand and improve the limits of watermarking, it also highlights the urgent need for more robust and ethical watermarking solutions.
Primary Area: Deep Learning->Large Language Models
Keywords: Language Model Watermarking; Watermarking Removal
Submission Number: 8304
Loading