Abstract: Deep Neural Networks (DNNs) are increasingly being deployed in cloud-based services via various APIs, e.g., prediction APIs. Recent studies show that these public APIs are vulnerable to the model extraction attack, where an adversary attempts to train a local copy of the private model using predictions returned by the API. Existing defenses mainly focus on perturbing prediction distribution to undermine the training objective of the attacker and thus inevitably impact the API utility. In this work, we extend the concept of watermarking to protect APIs. The main idea is to insert a watermark which is only known to defender into the protected model and the watermark will then be transferred into all stolen models. The defender can leverage the knowledge of watermarks to detect and certify stolen models. However, the effectiveness of the watermark remains limited since watermarks are distinct from the task data, and the adversary in extraction attacks only adopts inputs sampled from the task distribution. Hence the watermark tends to be discarded during the extraction attack. To bridge the gap, we propose a feature-sharing framework to improve the transferability of watermarks. For legitimate data and watermarks, we encourage the model to only show the difference in final decision layers and use the same features for all other layers. Comprehensive experiments on text and image domains indicate that the proposed framework is effective in terms of API watermarking while keeping the utility of the API. Besides, experimental analysis also validates the robustness of the watermark against various watermark removal attacks.
Supplementary Material: zip