# Overparameterized Ensembles

This repository contains the code for our paper [title]

## Table of Contents

- [Overparameterized Ensembles](#overparameterized-ensembles)
  - [Table of Contents](#table-of-contents)
  - [Installation and setup](#installation-and-setup)
  - [Usage](#usage)
  - [Reproducing the figures from the main paper](#reproducing-the-figures-from-the-main-paper)
    - [Figure 1](#figure-1)
    - [Figure 2](#figure-2)
    - [Figure 3](#figure-3)
    - [Figure 4](#figure-4)
    - [Figure 5](#figure-5)
    - [Figure 6](#figure-6)
  - [Code structure](#code-structure)
  - [Additional notes](#additional-notes)

## Installation and setup

This project uses [Poetry](https://python-poetry.org/docs/) for dependency management. Follow these steps to set up your environment:

First, [install Poetry](https://python-poetry.org/docs/#installation).

Secondly, navigate to the project directory, i.e., where the `pyproject.toml` file is located, and run:

```bash
poetry shell
poetry install
```

This will create a virtual environment and install all the dependencies listed in `pyproject.toml`.

Now you should be able to run the command-line tools and scripts in this project.

## Usage

We have the following commands available:

```bash
average-difference-vs_num-features
convergence-expected-value-term
generalization-error-decay
lipschitz-difference-infinite-models
variance-vs-number-of-features
variance-vs-points-in-range
visualize-models
```

After activating the virtual environment with `poetry shell`, you can run any of these commands by running:

```bash
<command-name> <command-parameters>
```

For example, to run the `average-difference-vs_num-features` command, you could run:

```bash
average-difference-vs_num-features --data-generating-function-name "sinusoidal" --num-features-end 25 --num-training-samples 6
```

To see the available parameters for each command, run:

```bash
<command-name> --help
```

The results of the commands will be saved in the `results` directory.

## Reproducing the figures from the main paper

To reproduce the figures from the main paper, you can run the following commands:

### Figure 1

```bash
visualize-models --num-training-samples 6 --num-features-per-model 200 --number-ensemble-members 1 --number-simulations-per-size 100 --random-seed 42 --plot-kernel-model --kernel "arc-cosine-kernel"
````
and
```bash
visualize-models --num-training-samples 6 --num-features-per-model 200 --number-ensemble-members 10000 --number-simulations-per-size 1 --plot-kernel-model --kernel "arc-cosine-kernel"
```

### Figure 2

```bash
convergence-expected-value-term --max-num-models 100000 --data-generating-function-name "sinusoidal" --data-dimension 1 --num-training-samples 6 --num-features-per-model 200 --random-seed 42
```
and
```bash
convergence-expected-value-term --max-num-models 100000 --data-generating-function-name "CaliforniaHousing" --data-dimension 8 --num-training-samples 12 --num-features-per-model 200 --kernel "erf-kernel" --activation-function "erf"
```

### Figure 3

```bash
average-difference-vs_num-features --data-generating-function-name "sinusoidal" --num-training-samples 6 --ridge 0.0 --data-dimension 1
```
and
```bash
average-difference-vs_num-features --data-generating-function-name "CaliforniaHousing" --num-training-samples 12 --ridge 0.0 --data-dimension 8 --kernel "softplus-kernel" --activation-function "softplus"
```

### Figure 4

```bash
variance-vs-points-in-range --number-points-to-test 1000 --data-generating-function-name "sinusoidal" --num-features-per-model 200 --num-training-samples 6 --random-seed 42
```

### Figure 5

```bash
variance-vs-number-of-features --data-generating-function-name "CaliforniaHousing" --num-features-per-model 200 --num-training-samples 12 --data-dimension 8 --random-seed 42 --max-num-models 35
```
and
```bash
generalization-error-decay --data-generating-function-name "CaliforniaHousing" --num-training-samples 12 --num-features-per-model 200 --max-num-models 35 --number-simulations-per-size 2500 --random-seed 42 --data-dimension 8
```

### Figure 6
    
```bash
lipschitz-difference-infinite-models --ridge-step 0.00001 --ridge-end 0.001 --ridge-start 0.0 --max-num-models 2000 --data-generating-function-name "CaliforniaHousing" --data-dimension 8 --num-training-samples 12 --num-features-per-model 200
```
and
```bash
lipschitz-difference-infinite-models --ridge-step 0.00001 --ridge-end 0.001 --ridge-start 0.0 --max-num-models 2000 --data-generating-function-name "CaliforniaHousing" --data-dimension 8 --num-training-samples 12 --num-features-per-model 200 --comparison-mode "ensemble"
```


## Code structure

We have the following structure in the code:

- `data_generation/`:
  - `data_generation.py`: Contains the functions to generate the data.
- `experiments/`:
  - `experiments.py`: Contains the base class for the experiments.
  - `average_difference_vs_num_features.py`: Code to run the experiment for Figure 2.
  - `convergence_expected_value_term.py`: Code to run the experiment for Figure 3.
  - `generalization_error_decay.py`: Code to run a experiment for Figure 5.
  - `lipschitz_difference_infinite_models.py`: Code to run a experiment for Figure 6.
  - `variance_vs_number_of_features.py`: Code to run the experiment for Figure 5.
  - `variance_vs_points_in_range.py`: Code to run the experiment for Figure 4.
  - `visualize_models.py`: Code to run the experiment for Figure 1.
- `matrices_and_kernels/`:
  - `kernel_calculations.py`: Contains the functions to calculate the kernel matrices.
  - `matrix_calculations.py`: Contains a function to calculate a cholesky decomposition.
- `models/`:
  - `ensembles.py`: Contains the ensemble models.
  - `kernel_models.py`: Contains the kernel models.
  - `model_utils.py`: Contains the functions to apply activation functions etc.
  - `random_feature_models.py`: Contains the random feature models. 
- `monte_carlo/`:
  - `monte_carlo.py`: Contains functionality to run a monte carlo estimation of an expected value (used in Figure 3).
  - `w_terms.py`: Contains functionality to calculate $W$ and $w_\perp$ (used in Figure 3).
- `utils/`:
  - `constants.py`: Contains constants used in the experiments.
  - `utils.py`: Contains general utility functions for the experiments.
- `visualizations/`:
  - `plots.py`: Contains the functions to plot the results of the experiments.
  - `data_visualization.py`: Contains the functions to visualize the data and the model predictions.

## Additional notes

The code is not designed to be run on GPUs. It should be sufficient to use a normal CPU.


