# Launching NIC-RobustBench on a SLURM cluster

This folder contains scripts and other files necessary to run NIC-RobustBench on a cluster controlled by [SLURM](https://github.com/SchedMD/slurm) workload manager. "How to run" section describes the instructions for launching large-scale evaluation on a cluster. Please refer to [official SLURM documentation](https://slurm.schedmd.com/) for more info on SLURM-specific commands and their arguments.


## File overview

| File                               | Description                                                                                           |
| ---------------------------------- | ----------------------------------------------------------------------------------------------------- |
| `env_vars.sh`   | Bash script defining environmental variables used by other SLURM scripts. Should be run before the massive_launch.sh or other scripts.                  |
| `run_config.yaml`       | Configuration file defining main run parameters, e.g. attack objective, attack preset, saving paths, etc.  |
| `slurm_script.sh`              | Main script that will launch the entire evaluation pipeline for specific `(codec, defense, attack, attack preset)` set and save the results. It is launched with `srun` command and runs entirely inside the job container.                       |
| `codecs_run_universal_1_attack.sh`       | Script for batched launch of NIC-RobustBench on a set of different codecs and *single* attack/attack preset. Should be launched with `sbatch` command. |
| `massive_launch.sh`             | Bash script that launches the evaluation of multiple attacks (with a specified attack preset) on multiple codecs. *Should be launched on SLURM master node*, as it internally dispatches multiple `sbatch` commands with  `codecs_run_universal_1_attack.sh` script, 1 attack per `sbatch`.  |
| `load_weights.py`                       | Auxiliary script that copies codec/attack weights and other files from the mounted repository path to job container (on the compute node). Invoked automatically by `slurm_script.sh`.                      |
| `codecs.txt`          | Newline-separated list of all codec models currently supported by NIC-RobustBench. Can be used by SLURM scripts for batched evaluation of all codecs. `codecs_debug.txt` and `codecs_nonjpegai.txt` files contain smaller subsets of this list and can also be used for batched job launches.           |

---


Script hierarchy:
```
massive_launch.sh (all attacks, all codecs)
├── codecs_run_universal_1_attack.sh (1 attack, all codecs)
│   ├── [inside job container] slurm_script.sh (1 attack, 1 codec)
│   |                           └── evaluation-pipeline/main_eval_script.py
│   └──...
└── ... 
```

## How to run

1. **Build docker containers** 

In our setup, we use SLURM combined with [Nvidia PyXis](https://github.com/NVIDIA/pyxis) rootless containers technology. First, Docker images should be build from corresponding `Dockerfiles`. The Dockerfiles are provided in `build-scripts` folder. We use two separate images: one specifically for JPEG AI codec model (`jpegai.Dockerfile`), and other for the rest of the NIC models (`main.Dockerfile`). The reason is that JPEG AI requires specific PyTorch version to work, which is incompatible with many other codecs. The images can be build locally with the following commands:

```bash
cd slurm-image-builds
docker build -f main.Dockerfile -t codecs-slurm/main:latest .
docker build -f jpegai.Dockerfile -t codecs-slurm/jpegai:latest .
```

2. **Export squashfs file for PyXis**

Our PyXis setup requires images to be exported into a `squashfs` file that will be used as unprivileged container by SLURM. This can be done with [Nvidia Enroot](https://github.com/NVIDIA/enroot) tool:

```bash
# This will create a file named codecs-slurm+main+latest.sqsh (23GB)
enroot import dockerd://codecs-slurm/main:latest
# This will create a file named codecs-slurm+jpegai+latest.sqsh (28GB)
enroot import dockerd://codecs-slurm/jpegai:latest 
```

These files should be transferred to the SLURM master node, e.g. with scp
```bash 
scp ./codecs-slurm+main+latest.sqsh *your cluster*:*your path*
scp ./codecs-slurm+jpegai+latest.sqsh *your cluster*:*your path*
```

3. **Setup repository and download weights for NIC models**

Following steps should be executed on the SLURM master (front-end) node.

First, clone this repo:

```bash
git pull https://github.com/msu-video-group/NIC-RobustBench
cd NIC-RobustBench
```

Next, download the weights for supported codecs (~10GB):

```bash
wget --backups=1 -nv https://titan.gml-team.ru:5003/fsdownload/o3kmmUJdU/models.zip \
         https://titan.gml-team.ru:5003/fsdownload/o3kmmUJdU/models.zip && rm models.zip.1
```

4. **Setup environment variables**

Before launching jobs via SLURM, variables should be set to define main paths and other parameters for your particular cluster setup. File `env_vars.sh` difines all variables required for NIC-Robustbench. Please edit this file according to your preferences. Key variables are:

* `REPO_PATH` : defines absolute path to the repository on the master node.
* `container_image_path`, `container_image_path`: absolute path to the squashfs files from the step 2.
* `outer_artifacts_path` : absolute path to a folder where artifacts will be stored (i.e., CSV files with numerical results and dump `.zip` files with examples of images from different steps of the codec evaluation pipeline). 
* `codec_list_path` : path to a `.txt` file containing newline-separated list of codecs to evaluate. Use `scripts-slurm/codecs.txt` for all available codecs. 
* `MODEL_WEIGHTS_PATH` :  path to a folder with codec weights, downloaded in the step 3.
* `slurm_*` parameters: SLURM-specific parameters. See comments in `env_vars.sh` and SLURM docs for details. Setup them according to your cluster specifications.


When all variables in `env_vars.sh` have been set, run
```bash
source scripts-slurm/env_vars.sh
```

to assign them.

5. **Run the evaluation**

`massive_launch.sh` script despatches the jobs that evaluate codecs specified in `codec_list_path` under different attacks. The list of attacks can be specified in `massive_launch.sh` itself: to do so, edit the `attacks` variable. `massive_launch.sh` dispatches the jobs via `sbatch` command, assigning 1 compute node per attack, evaluating all codecs from `codec_list_path` under this attack as separate jobs on a single compute node.

`massive_launch.sh` should be launched with 3 positional arguments:
* `$1` : The first argument is the attack preset number. It is an index of attack parameter set list defined in `evaluation-pipeline/attack_presets_codecs.json`.
* `$2` : The second argument specifies the objective (i.e., optimization loss) for the adversarial attacks. Full list of supported objectives is in `evaluation-pipeline\codec_losses.py` (`loss_name_2_func` mapping).
* `$3` : The last argument is the path to a YAML config file for the run. Use `scripts-slurm\run_config.yaml` for default params. 
> **Note** that run_config.yaml contains `attacked_dataset_path` and `reconstructed_dataset_path` fields that specify the paths to save attacked and codec-reconstructed images. These fields define paths *inside* the job container, and should point to an outside directory mounted to the container in order to be saved when the job ends. If these fields are left empty, attacked/reconstructed images wont be saved.

To start evaluation, launch the massive_launch.sh script with the desired params. For example, 
```bash
./scripts-slurm/massive_launch.sh 0 bpp_increase_loss ./scripts-slurm/run_config.yaml
```



