Each folder contains visualizes a attribution map from a different selection of samples from the following configurations:
1. Imagenette, Resnet50, VanillaGradient
1. Imagenette, ViT-B/16, Attention Map of the last layer

Each folder name contains:
1. bottom/top: referring to top k highest or lowest values of the quantity next to it sorted in a descending order.
2. entropy/distance: we sorted visualizations based on their attribution entropy which is defined for attention based attributions and distance which is defined as the l2 distance between the attacked attribution and the initial attribution.
3. group/instance normalized: this refers to the normalization strategy for visualization. Note that for a visual clearity heatmaps are usually normalized to [0, 1]. however it depends if this normalization is done over each image (instance normalized) or done over a group of samples with fixed min and max values (group normalized).

inside each folder we have
attrib-<activation function>, <learning rate>, <architecure>-<image index>.png

regarding the activation functions
1. If the activation function is RELU we are using softmax attention.
2. If the activation function is GELU_GELU we are using GELU both for the MLP and kernelized attention.

regarding the learning rates:
1. we have one small learning rate corresponding to NoR
2. we have one large learning rate corresponding to ICR
