VERSIONS:
We run the experiments with a Tesla T4 GPU.
The implementation is based on Python 3.10 and the following packages: torch 2.4.0+cu121, transformers 4.42.4, scikit-learn 1.3.2, scipy 1.13.1, pandas 2.1.4, numpy 1.26.4, ATC (https://github.com/saurabhgarg1996/ATC_code), Mandoline (https://github.com/HazyResearch/mandoline), typo (https://github.com/ranvijaykumar/typo), matplotlib 3.7.1, and seaborn 0.13.1.

##########
NOTEBOOKS AND PY FILES:

- ATC_helper.py and mandoline.py: extracts from ATC baseline code(https://github.com/saurabhgarg1996/ATC_code) and Mandoline baseline code (https://github.com/HazyResearch/mandoline), respectively.

- settings.py: this file contains all the settings by dataset (e.g. file name, list of features, number of classes) and for the pre-trained model.

- dataset.py: this file contains all the functions for pre-processing the data.

- model.py: this file contains all the classes defining the classification models (TTT, baselines, ablation studies).

- optimization.py: this file contains all the functions for fine-tuning and evaluating the pre-trained model.

- fine-tuning.ipynb: this main notebook allows the execution of fine-tuning for pre-trained classification models. The checkpoints are saved in 'trained_models/' directory.

- shift_generate.py: this file contains all the functions used to generate distribution shifts.

- shift_evaluate.py: this file contains all the functions related to EC and the baselines.

- shift_explain.py: this file contains all the functions related to the explanation algorithm.

- run_tests.ipynb: this main notebook allows the execution of EC, the baselines and the ablation studies. The model checkpoints are loaded from 'trained_models/' directory.

- uncertainty.ipynb: this main notebook allows the execution of the explanation algorithm and the measure of explanation quality.

##########
DATASETS:
The datasets are loaded from a folder called "datasets".
For some of the use cases, we use the original training dataset as the test dataset does not include the true labels (competition data). In that case, we consider the training dataset as the modeling data which is then randomly split into training-validation-test subsets.
airbnb: "cleansed_listings_dec18.csv"
cloth: "Womens Clothing E-Commerce Reviews.csv"
kick: "kickstarter_train.csv" ("train.csv" is the name of the original dataset)
petfinder: "petfinder_train.csv" ("train.csv" is the name of the original dataset)
salary: "Data_Scientist_Salary_Train.csv" ("Final_Train_Dataset.csv" is the name of the original dataset)
wine10 & wine100: "winemag-data-130k-v2.csv"
