# Understanding Memorization with Sample Gradients: Efficient, Theoretical, Practical

This repository contains the code and scripts to reproduce the experiments described in the paper _"Understanding Memorization with Sample Gradients: Efficient, Theoretical, Practical"_. Follow the instructions below to set up and run the various experiments.

## Setup

To properly set up your environment, you will need to:
1. Install necessary dependencies.
2. Deploy a MinIO Docker container for storage.
3. Configure your environment with the correct `config.json` and credentials files.
4. Download and place the **FZ Influence Matrix checkpoints**.

### Step 1: Install Dependencies

Ensure that your Python environment includes the necessary dependencies specifially, both tensorflow and pytorch should be installed and minio


### Step 2: Deploy MinIO Docker Container

MinIO will act as your object storage service. To deploy a MinIO Docker container locally or on a remote server, follow these steps:

1. **Pull the MinIO Docker image**:
    ```bash
    docker pull minio/minio
    ```

2. **Run the MinIO container**:
    ```bash
    docker run -p 9000:9000 -p 9001:9001 --name minio \
        -e "MINIO_ROOT_USER=<your-access-key>" \
        -e "MINIO_ROOT_PASSWORD=<your-secret-key>" \
        minio/minio server /data --console-address ":9001"
    ```

    Replace `<your-access-key>` and `<your-secret-key>` with your desired credentials. This command runs MinIO on ports `9000` (for S3 access) and `9001` (for the MinIO console).

3. **Access MinIO Console**: Once the container is running, you can access the MinIO web console by navigating to:
    ```
    http://<your-server-ip>:9001
    ```

    Login using the access and secret keys you provided when running the container.

### Step 3: Create Configuration Files

You need to create two configuration files: `config.json` and `credentials.json`. 

#### **config.json**

Create a `config.json` file in the root directory of the project. This file should look like the following:

```json
{
    "log_dir": "./logs",
    "seeds_dir": "./seeds",
    "data_dir": "/local/path/to/your/data/directory/",
    "imagenet_dir": "/local/path/to/imagenet/",
    "model_save_dir": "./pretrained/"
}
```

- Update the `"data_dir"` field to point to your local data directory where the datasets are stored.

#### **credentials.json**

Create a `credentials.json` file to store your MinIO credentials. This file should look like the following:

```json
{
    "url": "http://<minio-server-ip>:9001/api/v1/service-account-credentials",
    "endpoint": "<minio-server-ip>:9000",
    "accessKey": "<your-access-key>",
    "secretKey": "<your-secret-key>",
    "api": "s3v4",
    "path": "auto"
}
```

- Replace `<minio-server-ip>` with the IP address or domain where your MinIO service is hosted.
- Replace `<your-access-key>` and `<your-secret-key>` with the same values you used when setting up the MinIO Docker container.

### Step 4: Download FZ Influence Matrix Checkpoints

You will need to download the **FZ Influence Matrix checkpoints** and place them in the following directory:

```bash
analysis_checkpoints/dataset
```

Make sure the checkpoints are correctly placed in this directory for the analysis to work properly.

### Step 5: Verify the Setup

Ensure that your environment is configured correctly and that you have access to MinIO. You can use the MinIO console to verify that you can upload and access files from your data directory.

## Experiments
This section outlines the steps to reproduce the experiments from the paper. For all the experiments below we need to train models and compute scores at each checkpoint which can be done by the following:\
a. Use `train_cifar100.py` or `train_imagenet.py` to train models for CIFAR-100 and ImageNet, respectively. \
b. Run `score_cifar100.py` or `score_imagenet.py` to evaluate the models using CSG and other baselines on the CIFAR-100 and ImageNet models trained in step (a).

### 1. Validating Theory
c. Run `sim_exps/mem_sim_cifar100.ipynb` and `sim_exps/mem_sim_imagenet.ipynb` to reproduce the results for CIFAR-100 and ImageNet, respectively.

### 2. Similarity with Memorization
c. Run `sim_exps/mem_sim_cifar100.ipynb` and `sim_exps/mem_sim_imagenet.ipynb` to reproduce the results for similarity with memorization.

### 3. Early Stopping
c. Run `early_stopping_exps/early_stopping.ipynb` to reproduce the results for similarity with memorization.

### 4. Mislabelled Experiment

To reproduce the mislabelled experiment results, follow these steps:

The mislabelled experiment involves training multiple models (mislabelled, k-fold confidence learning, and SSFT models) and then scoring them. To run all the training and scoring jobs for the mislabelled experiments, execute the following script:

```bash
sh ./mislabeled_exps/mislabeled_exps.sh
```

This script will:
- Train the mislabelled models, k-fold confidence learning models, and SSFT models for both CIFAR-10 and CIFAR-100 datasets, across different noise levels and seeds.
- Score the models for each epoch to compute baseline metrics and learning time.


After running the mislabelled experiments, analyze the results using the `analyze_mislabeled/analyze_mislabeled.ipynb` notebook. This notebook will compute relevant metrics and generate visualizations as described in the paper. You have to move this to the root folder i.e. in the same location and `config.json` and `credentials.json`.


### 5. Duplicate Detection Experiment

The duplicate detection experiment involves training SSFT models, confident learning models, and standard models for the CIFAR-10 and CIFAR-100 duplicate datasets, followed by scoring them using various metrics.

To run all the training and scoring jobs for the duplicate detection experiments, execute the following script:

```bash
sh ./duplicate_exps/duplicate_exps.sh
```

This script will:
- Train SSFT models with different seeds and parts for both CIFAR-10 and CIFAR-100 duplicate datasets.
- Train standard models for the duplicate datasets.
- Train confident learning models for both CIFAR-10 and CIFAR-100 duplicate datasets.
- Score the models and compute metrics such as loss curvature, gradient, and learning time for both datasets.

Once the duplicate detection experiment has completed, analyze the results using the `duplicate_exps/analyze_duplicates.ipynb` notebook. This notebook will compute and visualize the relevant metrics as described in the paper. You have to move this to the root folder i.e. in the same location and `config.json` and `credentials.json`.
