Abstract: Model Extraction (ME) attacks have threatened the intellectual property of machine learning models, in which adversaries extract a target model using carefully crafted samples. Model watermarking is proposed to protect model ownership, which embeds specific information into models. State-of-the-art methods usually entangle watermark samples with main-task samples, aiming to provide robust watermark verification under ME attack. However, in this paper, we defeat the entangled watermarks and demonstrate their vulnerability to detection and removal attacks, using only a small set of clean samples. Further, we propose a novel framework, named MarkErase, to perform ME attack against entangled watermark. MarkErase is based on two key observations. First, we identify the unique classification tendency of watermarked models, enabling early detection of the watermark during an attack. Second, based on the observation that models with entangled watermarks tend to misclassify perturbed inputs as the target class, we propose a selective distillation method that effectively removes the watermark while maintaining the main-task accuracy. Comprehensive experiments show that MarkErase achieves a watermark-task accuracy close to 0, with minimal loss to the main-task performance. Our code is publicly available https://github.com/MarkErase/MarkErase.git.
External IDs:dblp:conf/pakdd/FeiLZYZZ25
Loading