## Explanation about the code and experiments:

This document will outline the steps required to execute each script effectively in order to attain the intended outcomes.

### Data:
In **create_data_loader.py**, we utilize five different types of datasets to produce results.
1. ***linear subspace model:***

The linear subspace model serves as a synthetic dataset enabling the generation of samples depicting various scenarios such as sample noise, feature noise, domain shift, and anomalies. The function 'gaussian_data_loader' within the script creates this dataset specifically for the linear subspace data, allowing the specification of a chosen scenario (referred to as 'scenario'). All other parameters, detailed in the paper, either serve as hyperparameters or factors regulating noise.

2. ***Single-cell RNA data:***

The 'single_cell_data_loader' function produces real-world samples of single cells, catering to scenarios involving sample noise, feature noise, and domain shift. You can obtain the dataset from the following link: https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking/tree/master/Data/dataset4. Ensure to place the downloaded dataset under the 'data' folder.

3. ***CelebA:***

The 'anomaly_data_loader_celeba' function generates real-world samples specifically for the anomaly detection experiment. This dataset is already located within the 'data' folder.

4. ***Non linear subspace data (Appendix):*** 

This synthetic dataset differs from the linear subspace data by being non-linear. Created from the 'gaussian_data_loader_non_linear' function.

5. ***MNIST (Appendix)***:

MNIST and MNIST-M datasets are utilized to train CNN models ('mnist_data_loader'). 

### Models:
**MLP:** We utilize a consistent MLP autoencoder model for all scenarios and datasets except MNIST. This model comprises a single hidden layer in both the encoder and the decoder, with a bottleneck layer in between. It's worth noting that the width of our model is adjustable, as we permit experimentation with the number of hidden layers. However, in our paper, we adhere to using one hidden layer for both the encoder and decoder. You can locate this model in **AutoEncoder_models.py** under the class name 'MLPAE'.

**CNN (Appendix):** We train CNNs with three layers in the encoder and decoder with a bottleneck layer inbetween on the MNIST and MNIST-M datasets to show that double descent is present in different model architectures. This model can be found in **AutoEncoder_models.py** under the name 'CNNAE'.

### Hyper-parameters and other parameters:
Within the **config.yaml** file, you'll find all the hyperparameters for training, along with other parameters necessary for evaluating the desired outcomes. We employ the omegaconf package to extract values from the config.yaml file.

### Training
The **train_test.py** script oversees all the training conducted throughout this research endeavor.

### Results:
The **model_epoch_wise_gaussian.py** script orchestrates the generation of results pertaining to both model-wise and epoch-wise double descent. Leveraging the linear subspace data, it employs the parameter 'scenario' to produce outcomes across all scenarios, encompassing sample noise, feature noise, domain shift, and anomalies.

Similarly, the **model_epoch_wise_cells.py, model_epoch_wise_non_linear_subspace.py, model_epoch_wise_mnist.py** scripts utilizes their datasets to generate results for model-wise and epoch-wise double descent scenarios, focusing on sample noise, feature noise, and domain shift.

Conversely, the **model_epoch_wise_anomalies.py** script leverages the CelebA dataset to explore model-wise (and epoch-wise) double descent phenomena in the context of anomalies.

While these scripts share many similarities, they are divided into different entities to accommodate different datasets and scenarios. Variations exist due to the nuances of each scenario. Each script begins with imports and proceeds to read parameters from the config.yaml file. By specifying an experiment_name, the code saves all resulting CSV files within a folder bearing that name under the 'results' directory.

Lastly, the **sample_wise_gaussian.py, sample_wise_mnist.py, sample_wise_non_linear_subspace** scripts produce results relating to sample-wise double descent phenomena.

### Plots:

The **plots.ipynb** notebook includes all figures presented in the paper. To view the results, ensure that you have saved them in the designated paths expected by each function. Alternatively, you can modify the code to point to the location where you have saved the results.

The **check_latent_domain_shift.ipynb** notebook generates the results detailed in the Appendix titled 'Domain Adaptation Based on Model Size'. Before displaying the results, you'll need to train the models, which will then be saved along with the training and testing data in the 'domain shift models' folder.

The **mnist_images_check.ipynb** notebook observes the reconstruction of noisy MNIST images on different model sizes, displayed in the Appendix section. To check results, make sure to train models and save their parameters in the 'mnist_sample_noise_models' folder. 

Each plot generated provides the option to be saved. The default saving path is the 'image' folder, wherein subfolders exist for each scenario (sample noise, feature noise, domain shift, and anomalies).