# Supplementary Materials for ''Can We Use Gradient Norm as a Measure of Generalization Error for Model Selection in Practice?''

The recent theoretical investigation on the upper bound of generalization error of deep neural networks (DNNs) demonstrates the potential of using the gradient norm as a measure that complements validation accuracy for model selection in practice. 
In this work, we carry out empirical studies using several commonly-used neural network architectures and benchmark datasets to understand the effectiveness and efficiency of using gradient norm as the model selection criterion, especially in the settings of hyper-parameter optimization.
While strong correlations between the generalization error and the gradient norm measures have been observed, we find the computation of gradient norm is time consuming due to the high gradient complexity. To balance the trade-off between efficiency and effectiveness, we propose to use an accelerated approximation of gradient norm that only computes the loss gradient in the Fully-Connected Layer (FC Layer) of DNNs with significantly reduced computation cost (200$\sim$20,000 times faster). Our empirical studies clearly find that the use of approximated gradient norm, as one of the hyper-parameter search objectives, can select the models with lower generalization error, but the efficiency is still low (marginal accuracy improvement but with high computation overhead). Our results also show that the bandit-based or population-based algorithms, such as BOHB, perform poorer with gradient norm objectives, since the correlation between gradient norm and generalization error is not always consistent across phases of the training process. 


This documentation is to explain how to run the files found in the supplementary materials.

## Dependencies
torch                     1.3.0 

torchvision               0.4.1

paddlepaddle-gpu          1.8.0.post107

paddlehub                 1.4.3

configspace               0.4.12

hpbandster                0.7.4

netifaces                 0.10.9

pyro4                     4.79

## Results and Command to Reproduce
Main results are reported in the paper (page 7 and 8).

The following examples are based on ResNet-20 architecture and CIFAR-10 dataset. Simply change the argument to other architecutres or dataset if you would like to reproduce other results.

### Using Blackbox Optimization

To run blackbox optimization (CMA-ES) for hyper-parameter search, use the following command
```
hub autofinetune agn_blackbox_opt.py --param_file=hparam.yaml --gpu=0,1,2,3,4,5,6,7 --popsize=8 --round=4 --output_dir=resnet20_cifar10_cmaes_alpha0.05 --evaluator=fulltrail --tuning_strategy=hazero epochs 160 alpha 0.05 arch resnet20_cifar cifar-type 10
```

To run blackbox optimization (PSO) for hyper-parameter search, use the following command
```
hub autofinetune agn_blackbox_opt.py --param_file=hparam.yaml --gpu=0,1,2,3,4,5,6,7 --popsize=8 --round=4 --output_dir=resnet20_cifar10_pso_alpha0.05 --evaluator=fulltrail --tuning_strategy=pshe2 epochs 160 alpha 0.05 arch resnet20_cifar cifar-type 10
```

Modify line 349 in this file to change the criterion used in hyper-parameter search. E.g. compute the training loss beforehand and change it to -(training loss)-alpha \* AGN

### Using BOHB Optimization

To run BOHB for hyper-parameter search, use the following command
```
python bohb_main.py --mode valid_loss_agn_tuple --shared_directory resnet20_cifar10_bohb_alpha0.2 --max_budget 160 --min_budget 40 --eta 2 --n_iterations 12 --alpha 0.2 --arch resnet20_cifar --dataset cifar10

python bohb_main.py --worker --mode valid_loss_agn_tuple --shared_directory resnet20_cifar10_bohb_alpha0.2 --max_budget 160 --min_budget 40 --eta 2 --n_iterations 12 --alpha 0.2 --arch resnet20_cifar --dataset cifar10
```

Repeat the second command for the desired number of times in order to run the search in parallel using multiple GPUs.

### Evaluation on Full Dataset

Running the above commands will produce a set of hyper-parameters searched using the reduced dataset. 
Then one may retrive the hyper-parameters found from the logs and train a model using the full dataset to test model performance.
The file used to evaluate the performance of serched hyper-parameters is train\_noval\_main.py



## Acknowledgement
The code using BOHB is adapted from [this GitHub repository](https://github.com/automl/HpBandSter).

