# Aggregate Models, Not Explanations: Improving Feature Importance Estimation

Code for reproducing the experiments from the paper "Aggregate Models, Not Explanations: 
Improving Feature Importance Estimation". 

# Requirements

All required packages can be installed using pip:
``` bash
pip install -e .
```


# Usage

## Python scripts

Reproducing the experiments can be done by running two scripts:
1. `ensemble_vim/simulation.py` for running variable importance on the benchmark datasets
2. `ensemble_vim/asymptotic.py` for obtaining an asymptotic estimate of feature importance, used as 
    ground truth in the simulations.

Both scripts have several command line arguments to customize the experiments: 


``` bash
python ensemble_vim/simulation.py \
    --n_samples 100 500 1000 \  # (Required) List of sample sizes to simulate (e.g., 100, 500, 1000)
    --seed 1 \                  # Random seed for reproducibility
    --n_jobs 4 \                # Number of parallel CPU jobs to run
    --n_splits 5 \              # Number of Cross-Validation (CV) splits
    --snr 1 \                   # Signal-to-Noise Ratio (SNR) for the generated dataset
    --n_ensemble 10 \           # Number of individual models in the ensemble
    --results_dir ./results \   # Path to the directory where results will be saved
    --dataset_name friedman1 \  # Name of the dataset (e.g., friedman1, ishigami, g_function)
    --n_features 20 \           # Total number of features (variables) in the dataset
    --model_name mlp \          # Type of model to use: 'mlp' (Multi-Layer Perceptron) or 'rf' (Random Forest)
    --ensemble bagging \        # Ensemble strategy: 'voting' (different inits) or 'bagging' (bootstrap samples)
    --sage                      # (Optional) Flag to compute SAGE variable importance values
```

Note that the computation of SAGE values can be time-consuming, removing the `--sage` flag will skip this step.

## Cluster execution

The scripts `ensemble_vim/run_simulation.slurm` and `ensemble_vim/run_asymptotic.slurm` can be used to submit the 
experiments to a SLURM cluster. They use job arrays to parallelize the experiments over
different random seeds.


# Results
The results of the experiments will be saved in the specified `results_dir`. The directory structure will be organized as follows:
```
results/
    ├── <dataset_name>_<model_name>_n<n_samples>_p<n_features>_<ensemble><n_ensemble>/
    │   ├── models/
    │   ├── scores_<dataset_name>_<seed>.csv
    |   ├── support_<dataset_name>_<seed>.npy
    │   ├── cfi_<dataset_name>_<seed>.csv
    │   ├── sage_<dataset_name>_<seed>.csv
    │   ├── loco_<dataset_name>_<seed>.csv
    |   └── ...
    └── ...
```

 - `models/`: Directory contains `.pkl` files of trained models for each seed and each 
 CV split.
 - `scores_<dataset_name>_<seed>.csv`: CSV file containing predictive performance scores 
 (MSE, R2) for each model and CV split.
 - `support_<dataset_name>_<seed>.npy`: Numpy file containing the true support (features 
 used in the data-generating process) for the dataset and seed.
 - `<method_name>_<dataset_name>_<seed>.csv`: CSV file containing variable importance 
 values estimated with either the *ensemble* or *sub-models* strategy. There is one file
 for each variable importance method, where `<method_name>` can be `cfi`, `sage`, or 
 `loco`.

# Figures

The figures from the paper can be generated by running the scripts in the `ensemble_vim/figures/`. 
The module `ensemble_vim/figures/utils.py` contains helper functions for reading the results and
computing aggregated metrics. Then each figure script (e.g., `figure_2.py`) generates
the corresponding figure. Only the path to the results directory need to be modified in
at the beginning of each figure script 
(`results_dir = Path("/path/to/your/results/directory")`). 

# UKBB Experiment

The code for reproducing the UKBB experiment is in the `ensemble_vim/script_ukbb.py` 
script. It uses functions from the `ensemble_vim/utils.py` module. 