## SEEKER: Query-Efficient Model Extraction via Semi-Supervised Public Knowledge Transfer
You can find here the implementation for our paper *SEEKER: Query-Efficient Model Extraction via Semi-Supervised Public Knowledge Transfer*.
> Model extraction attacks against neural networks aim at extracting models without white-box access to model internals and training datasets. Unfortunately, most existing methods demand an excessive number of queries (up to millions) to reproduce a functional substitute model, greatly limiting their real-world applicability. In this work, we propose a query-efficient model extraction attack that effectively distills knowledge from publicly available data. To this end, we introduce a self-supervised training mechanism to pre-train the substitute model without interacting with the victim model. The proposed mechanism optimizes the substitute model to learn a generalizable image encoding pattern based on semantic consistency of neural networks. We further propose a query generator that enhances the information density of generated queries by aggregating public information, thereby greatly mitigating the query cost required for constructing the substitute model. Extensive experiments demonstrate that our method achieves state-of-the-art performance which improves query-efficiency by as much as 50× with higher accuracy. Additionally, our attack demonstrates the capability of bypassing most types of existing defense mechanisms.
### Requirements
* PyTorch >= 1.7.1
* Numpy >= 1.21.2
* torchvision >= 0.8.2
* advertorch >= 0.2.3

### Directory layout
```
.
├── checkpoints     # This directory includes model checkpoints.
├── code
│   ├── classifier  # This directory includes model architecture for victim and substitute.
│   ├── query  # This directory query generation and victim querying.
│   ├── config.py   # This is the configuration file.
│   ├── data    # This directory includes files for reading and pre-process datasets.
│   ├── generator   # This directory includes the architecture file for the aggregated query generator.
│   ├── loss    # This directory includes files that define the loss functions.
│   ├── main.py
│   ├── setup.py    # This file includes the main procedures for our framework.
│   ├── simclr  # This directory includes files for running SimCLR.
│   ├── trainer
│   │   ├── classifier_trainer.py # This file trains the victim models.
│   │   ├── __init__.py
│   │   ├── simclr_trainer.py   # This file trains the substitute with SimCLR.
│   │   └── substitute_trainer.py
│   └── utils
│       ├── __init__.py
│       ├── plot.py # This file is for plotting graphs.
├── datasets    # This directory includes datasets, e.g. CIFAR-10.
├── LICENSE
├── README.md
└── results     # This directory includes results, e.g. figures for accuracy change during substitute training.
```

### Datasets
Our experiments involve CIFAR-10, CIFAR-100, Tiny ImageNet and ImageNet datasets.
The CIFAR-10 and CIFAR-100 datasets can be automatically downloaded with our scripts.
According to the requirement of the authors, the Tiny ImageNet and ImageNet datasets need to be manually downloaded.
All datasets should be put under `./datasets` directory.

### Pre-trained models
Due to the anonymity requirements, we are unable to provide a link for downloading the checkpoints.
We will provide the pre-trained models after the double-blind reviewing process.

### How to attack
We include recommended configurations for reproducing results when using CIFAR-10 as D_S and CIFAR-100 as D_P in `config.py`.
To reproduce results when using CIFAR-100 as D_S and CIFAR-10 as D_P, you need to modify `config.py` according to the following table.
You can also change the value of 'public_dataset' to select other public datasets, e.g. Tiny ImageNet and ImageNet.

| Variable | Value |
|  ----  | ----  |
|victim_dataset| cifar100|
| public_dataset | cifar10 |
|noise_eps|0.2|
|noise_step|30|
|eps_multiple|2|

After specifying your own home directory, you can simply run `python main.py` to reproduce our results.