# Imagenet Training

The goal in this folder is to train a resnet18 on the [imagenet dataset](XXXX). We train it many different configurations for:
- nesim loss
- cross layer wiring loss

This is what each file does:
- `generate_cross_layer_wiring_loss_configs.py`: generates the cross layer wiring loss configs based on the possible values mentioned in the script. Saves json files in the folder `cross_layer_correlation_configs/`
- `generate_nesim_configs.py`: generates all possible nesin configs. Optionally including `scale = None` which means baseline run. Saves json files in the folder `nesim_configs/`
- `run_all_possible_trainings.py`: generates the commands to run `train.py` with different CLI args based on the configs found in `cross_layer_correlation_configs` and `nesim_configs`. Here are some useful notes:
  - If you add the `--slurm` arg, then it saves all the possible shell files in a folder named: `slurm/`.
  - The default slurm config asks for 32 CPU cores, 20GB RAM and 1 A100 GPU on a runtime of duration 48 hours.
- `train.py`: this is the main entrypoint script which start and runs the training. each training run is given a unique `run_name` based on the config args
- `possible_nesim_layers.json`: contains the names of layers upon which we have to apply the nesim losses.

## Instructions:

```
python3 generate_nesim_configs.py
python3 generate_cross_layer_wiring_loss_configs.py
python3 run_all_possible_trainings.py --slurm
python3 start_all_slurm_jobs.py
```

And to train without slurm, you remove the `--slurm` arg just run `python3 run_all_possible_trainings.py`

## Troubleshooting
If you face an error that has something to do with: `CXXABI_1.3.9`, run the following:

```
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/mindhive/nklab3/users/XXXX-1/conda_stuff/lib/
```

## Resuming runs checklist:
- `--resume-wandb-run-id` - wandb run ID
- `--nesim-config` - config json filename is a subset of the run name on wandb: `*all_conv_layers`
- `--nesim-apply-after-n-steps` check from run name
- `--num-epochs` - also in run name
- `--resume-from-checkpoint` - is something like: `checkpoints/imagenet/` + `run_name` + `all/train_step_idx_{LATEST_TRAIN_IDX}` where `LATEST_TRAIN_IDX` can be found by doing an `ls` on the folder
* check learning_rate in `resume_training.py` - make sure it matches the current learning rate in the crashed run
