Grey-box Extraction of Natural Language Models

Santiago Zanella-Beguelin; Shruti Tople; Andrew Paverd; Boris Köpf

Grey-box Extraction of Natural Language Models

Santiago Zanella-Beguelin, Shruti Tople, Andrew Paverd, Boris Köpf

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: language models, transformer, model extraction, security

Abstract: Model extraction attacks attempt to replicate a target machine learning model from predictions obtained by querying its inference API. Most existing attacks on Deep Neural Networks achieve this by supervised training of the copy using the victim's predictions. An emerging class of attacks exploit algebraic properties of DNNs to obtain high-fidelity copies using orders of magnitude fewer queries than the prior state-of-the-art. So far, such powerful attacks have been limited to networks with few hidden layers and ReLU activations. In this paper we present algebraic attacks on large-scale natural language models in a grey-box setting, targeting models with a pre-trained (public) encoder followed by a single (private) classification layer. Our key observation is that a small set of arbitrary embedding vectors is likely to form a basis of the classification layer's input space, which a grey-box adversary can compute. We show how to use this information to solve an equation system that determines the classification layer from the corresponding probability outputs. We evaluate the effectiveness of our attacks on different sizes of transformer models and downstream tasks. Our key findings are that (i) with frozen base layers, high-fidelity extraction is possible with a number of queries that is as small as twice the input dimension of the last layer. This is true even for queries that are entirely in-distribution, making extraction attacks indistinguishable from legitimate use; (ii) with fine-tuned base layers, the effectiveness of algebraic attacks decreases with the learning rate, showing that fine-tuning is not only beneficial for accuracy but also indispensable for model confidentiality.

One-sentence Summary: We demonstrate effective grey-box extraction of classification layers of natural language models with public encoders

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Reviewed Version (pdf): https://openreview.net/references/pdf?id=rfaVRyxmQI

12 Replies

Loading