# Experiments for the KAURI and DOUGLAS paper

This readme simply describes how to rerun each experiment that we described in our original work.

As always, everything start with setting up your environment using the provided requirements:

```
pip install -r requirements.txt
```

Then, pick a sub-section matching an experiment for reproduction.

All experiments are divided in two parts:
+ One part which consists in running the scripts to obtain the results
+ One part which summarises and produces figures out of the results

Most of the figures are generated using ggplot2 in R, so you may need to set up as well your R environment which requires:
+ ggplot2
+ tidyr
+ readr
+ dplyr
+ optparse

For downloading the required datasets, please run:

```
cd data/datasets && ./download_datasets.sh
```

## Model selection

We provide a [Snakefile](https://snakemake.github.io/) at the root of the project to properly orchestrate all runs for the model selection. All you have to do is run:

```
snakemake all --cores all
```

inside the `model_selection` folder.

**Note**: you may provide of course fewer cores than everything.

After a little while, a complete folder untitled "Predictions" will be produced containing various predictions for several datasets. Combining the results into a single figure can be done using the script `notebooks/generate_model_selection_fig.R`. All figures can be obtained using the R notebook `generate_figures_R.ipynb`.

## Benchmark runs

The benchmark can be run using the script `scripts/main_benchmark.py`, with the following usage:

> usage: benchmark_main.py [-h] {kauri,douglas,imm,exshallow,exkmc,ktree,rdm} ...
>
> optional arguments:
>   -h, --help            show this help message and exit
> 
> method:
>   {kauri,douglas,imm,exshallow,exkmc,ktree,rdm}

Specifying then a model, e.g. `kauri` or `douglas` will lead to different parameters. The common arguments for most experiments are:

```
for i in $(seq 1 30); do echo $i; python scripts/benchmark_main.py %METHOD% --dataset %DATASET% --n_clusters %K% --path_to_data data/datasets --subset_size 0.8 --output_file benchmark_results/%METHOD%/%DATASET%_run_${i}.csv; done
```

You can replace the tags %DATASET% by `avila`,`breast_cancer`, `car_evaluation`, `congressional_votes`, `digits`, `haberman_survival`, `iris`, `mice_protein`, `poker_hand`, `vowel`, `wine`; %K% by the appropriate number of clusters and %METHOD% by one of the methods mentionned in the help above.

To obtain all possible results of the benchmark, you can directly go in the `benchmark``folder and run the Snakefile target using:

```
snakemake all --cores all
```
The analysis of the result is then contained in the matching jupyter notebook `Benchmark_results.ipynb`.

## Impact of the leaves on the WAES scores

To run this experiment, simply go to the folder `leaf_limit` and run the Snakefile using:

```
snakemake all --cores all
```

Once again, the figures for the paper were generated using the attached notebooks of the same folder.

## Synthetic dataset with rotation of angles

This experiment is carried with the notebook `synthetic_data_example/Computing WAD and WAES.ipynb`. The notebook produces csvs containing the WAD/WAES score depending on the dataset parameters.

## Explainable "Votes" tree

Just open and run the notebook `A qualitative visualisation of the US votes dataset.ipynb` in the folder `qualitative_example`.

