# Installation and outline
Before running our code, you need to install the main package
`./io_model/` as well as some edited packages for arch2vec and
NAS-Bench-101 manipulation. The folder structure and installation instructions
are summarized in the following sections.

The numbered sections contain instructions for reproducing
our results (steps 1.-7.):

1. Train networks from NAS-Bench-101
2. Create the IO dataset
3. Train the IO model or the Accuracy model
4. Compute baselines of datasets
5. Analyze the losses and the dataset in notebooks
6. Extract features
7. Run reinforce or performance prediction

You can skip steps 1. and 2. if you use the IO datasets included in
this folder (described in step 3.).

## Get data
To run our code, you need to download the NAS-Bench-101 dataset -
here is the [link](https://storage.googleapis.com/nasbench/nasbench_only108.tfrecord)
to the smaller subset of the dataset, you can find the full data
in the original [repository](https://github.com/google-research/nasbench).

Created datasets as well as trained checkpoints with extracted features for
the studied models, and experiment results, can be downloaded [here](https://ufile.io/f/ff0x6).
Extract the zip files into the folder `./io_model/data/`, so that no paths need to be
corrected. No subdirectories should be added, so the correct path will be for
example `./io_model/data/paper_final_results/`, not `./io_model/data/data/paper_final_results/`.

The checkpoints of trained networks can be downloaded from [here](https://ufile.io/f/ppsl8).
We also provide hashes of the trained networks (in the previous download link).

## Folder structure
- packages
  - `./io_model/`, `./nasbench/`, `./NASBench-PyTorch/`, `./arch2vec/`
- scripts for step 7.
  - `run_reinforce.sh`, `run_performance_pred.sh`
- pip and conda requirements file (for debugging, see installation instructions)
- the **data folder** is in `./io_model/data`, contents:
  - `paper_final_results/`
    - all models trained for 3 seeds, their checkpoints (step 3.),
      extracted features (step 6.) and experiment results (step 7.)
    - baselines (step 4.)
    - IO datasets
      - `train_long.pt`, `valid_long.pt`, `test_small_split.pt` -
        train, unseen network validation, unseen images validation datasets
      - `test_train_long.pt`, `test_valid_long.pt` - test sets
    - pickled NAS-Bench-101 (`nasbench.pickle`) for faster training
    - `nb_dataset.json` - NAS-Bench-101 networks, architectures only
      - created by arch2vec code
    - `train_hashes.csv`, `valid_hashes.csv` - train and validation hashes of labeled networks

## Installing IO model library

For running the code, you need to install the four packages included
in this folder:
- `./io_model/`, the main package - our contribution
- `./nasbench/`, the original nasbench modified for `TensorFlow >= 2.0` 
- `./NASBench-PyTorch/`, library that translates NAS-Bench-101 architectures
  into PyTorch models
- `./arch2vec/`, the arch2vec library modified so that the model could be extended into the IO model


All models are written in `PyTorch`, but `TensorFlow >= 2.0` is needed
for the NAS-Bench-101 dataset. A script that creates a new virtual
environment with all the necessary packages is provided:
```bash
cd './io_model/'
bash setup_venv.sh
```

We also provide the outputs of `conda list -e` and `pip freeze` of the
environment we used: `conda_requiremens.txt` and `pip_requirements.txt`.

# 1. Train networks from NAS-Bench-101
To pretrain the train and validation networks, run the following script:
```bash
cd './io_model/scripts/'
python pretrain_i_th_hashes.py --hash_csv '../data/train_hashes.csv'
python pretrain_i_th_hashes.py --hash_csv '../data/valid_hashes.csv'
```
The checkpoints will be saved to `../data/out_train_hashes/` and `../data/out_valid_hashes/` respectively. 

# 2. Create the IO dataset
To create the IO dataset, run `create_io_dataset.py` in the same directory:
```bash
cd ./io_model/scripts/
python create_io_dataset.py '../data/out_train_hashes/' --save_path '../data/train_io_dataset.pt'
python create_io_dataset.py '../data/out_valid_hashes/' --save_path '../data/valid_io_dataset.pt'
```
The first argument can also be a list of directories that contain checkpoints of trained networks.
If you want to use CIFAR test data instead of CIFAR validation data, pass the option `--use_test_data`
to the script (this was used to create the test datasets in the paper, see Appendix).

# 3. Train the IO model or the Accuracy model
To train the IO model (including the reference arch2vec trained on the same batches),
run the following script:

```bash
cd './io_model/scripts/'
# export CUBLAS_WORKSPACE_CONFIG=:4096:8  # you may need to set this variable
python train_vae.py --model_cfg '../configs/model_config.json' --epochs 10 \
  --device cuda --use_ref --seed 1 --deterministic --use_unseen_data
```

To train the Accuracy model, run the same command with the argument `--use_accuracy`. By default,
it trains on the already created train and validation datasets.

Other useful arguments:
- `--train_path`, `--valid_path`
  - path to train or valid data to train from
  - e.g. `../data/train_io_dataset.pt` from step 2.
- `--unseen_valid_path`
  - path to unseen images validation dataset
- `--no_unseen_data`
  - turn off the unseen images validation dataset
- `--checkpoint_path`
  - directory where the checkpoints are saved

To train the original arch2vec, use their scipt:
```bash
python ./arch2vec/arch2vec/models/pretraining_nasbench101.py \
  --data './io_model/data/nb_dataset.json' \
  --dim 16 --cfg 4 --bs 32 --epochs 8 --seed 1 --name arch2vec_orig
```
The result will be saved to
`./arch2vec/arch2vec/models/pretrained/dim-16/model-arch2vec_orig.pt`.
Steps 6. and 7. are the same for this checkpoint as for the IO and Accuracy
checkpoints.

# 4. Compute baselines of datasets
To compute the labeled baselines (see Experiments in the paper), run the following script:
```bash
cd './io_model/scripts/'
python compute_dataset_stats.py train --dataset '../data/train_long.pt'
python compute_dataset_stats.py valid --dataset '../data/valid_long.pt'
```

The baseline of the unseen images validation set is computed like this:
```bash
cd './io_model/scripts/'
python compute_dataset_stats.py train --dataset '../data/test_train_long.pt' \
  --model_cfg '../configs/model_config.json' --split_ratio 0.1
```

# 5. Analyze the losses and the dataset in notebooks
The losses and metrics are processed in the notebook
`./io_model/notebooks/process_training_stats.ipynb`. The data is fetched
from `./io_model/data/paper_final_results/`, the folder with all training
results, checkpoints and the baselines.

To analyze the network outputs (from the IO dataset), look at the notebook
`./io_model/notebooks/heatmaps_whole_dataset.ipynb`. The notebook was
used for these outputs:
1) create histogram of train network accuracies
2) query NAS-Bench-101 for network training time
3) perform the clustering of networks by output features

# 6. Extract features
In sections 6. and 7., you can work with checkpoints in folder `paper_final_results/`,
where the trained checkpoints are provided.

To extract features of NAS-Bench-101 using a model (i.e., predict on all architectures
from the search space), call the following script:

- for the IO model (like arch2vec, model 1)
```bash
cd './io_model/scripts/'
python extract_infonas_embedding.py --dir_path $CHECKPOINT_DIR \
  --model_path model_orig_epoch-9.pt \
  --is_arch2vec
```
- for the IO model (flatten, model 2)
```bash
cd './io_model/scripts/'
python extract_infonas_embedding.py --dir_path $CHECKPOINT_DIR \
  --model_path model_orig_epoch-9.pt
```
- for the reference model 
```bash
cd './io_model/scripts/'
python extract_infonas_embedding.py --dir_path $CHECKPOINT_DIR \
  --model_path model_ref_epoch-9.pt
```
- for the accuracy model
  - like IO model (both 1 and 2), but with `--is_accuracy`

All embeddings are saved as `f"{dir_path}/embedding_{model_path}"`.

# 7. Run reinforce or performance prediction
To run reinforce or performance prediction, two bash scripts are provided
in the root of the code folder. This time, don't call `cd '.io_model/scripts/'`
like in all examples.

- reinforce 100 times (`$EMBEDDING_PATH` is the path to features extracted in the previous step):
```bash
./run_reinforce.sh $EMBEDDING_PATH
```
- reinforce 100 - `$i` times (resuming from seed `$i`)
```bash
./run_reinforce.sh $EMBEDDING_PATH $i
```
- performance prediction (multiple sample sizes, 10 seeds)
```bash
./run_performance_pred.sh $EMBEDDING_PATH
```

The results are saved in `dirname $EMBEDDING_PATH` in the folders
`reinforce-runs/` and `regr/` respectively. 

Finally, the notebooks `./io_model/notebooks/process_reinforce_results.ipynb`
and `./io_model/notebooks/process_regressor_results.ipynb` visualize
the results of both runs (relying on the directory structure in
`./io_model/data/paper_final_results/`).