# NOTES

- the full posterior covariance is given by the formula on the following line
  https://github.com/google/edward2/blob/d2571a25bd4ed4a4575f4f64f5048b5e1a8bf233/edward2/tensorflow/layers/random_feature.py#L411

- the authors used a lower dropout rate and a different ordering than the original WRN implementation (original
  implementation example: https://github.com/meliketoy/wide-resnet.pytorch/blob/master/networks/wide_resnet.py)
  - the original code went bn --> relu --> conv --> dropout and there was only one dropout layer in a residual block
  - the code for SNGP seems to put dropout layers in between every conv https://github.com/google/uncertainty-baselines/blob/1c0b2f9353593eb7f54b553ab35f113614d7de5c/uncertainty_baselines/models/wide_resnet_sngp.py
  - SNGP also uses the ordering of batchnorm --> relu --> dropout --> conv


- The Hessian of the likelihood w.r.t. Beta in the multiclass case would be p(1-p)PhiPhi^T as stated in the paper,
  and as can be seen in problem 6 here: https://www.overleaf.com/project/60ebad6389ab3e5a46c2dd57
  however, they do not actually do that in practice in their code: https://github.com/google/edward2/blob/main/edward2/tensorflow/layers/random_feature.py#L371
  I think this is because the actual GP posterior they are trying to compute is of the form K*(K^-1 + lambdaI)K* where
  there is none of the preceeding Hessian multipliers.

- The GP layer has an option for a bias in the original, but it is set to 0: https://github.com/google/uncertainty-baselines/blob/main/baselines/cifar/sngp.py#L132

- The official code sets the ridge penalty at 1 and updates the covariance matrix once even though the paper says their
  experiments do the iterative update method. This may have been an aftertought in designing it, but I chose to follow
  their code exactly

- In the toy experiments, I have found that the ridge penalty is very task dependent, on the three toy tasks, I had to
  set the value to very different values in order to get good qualitative results
