This documents explains how to install the library and launch the experiments.

# Installation of the c++ backend
This code relies on a fork of Gudhi that allows to do our multi-parameter persistence computations.
It has a few dependencies that can be installed with, e.g., conda.
```sh
conda create -n python311
conda activate python311
conda install python=3.11 boost tbb tbb-devel numpy matplotlib gudhi scikit-learn cython sympy tqdm cycler typing shapely -c conda-forge
pip install filtration-domination
```
and can be installed, using a terminal, in the `multipers` directory 
```sh
pip install .
```
You can also take a look at the readme in this folder for a basic overview of the functions in this library

# Running the experiments
## Datasets
All of the datasets are supposed to be located at `$HOME/Datasets/<type_of_dataset>/<name_of_dataset>/` where `<type_of_dataset>` can be, e.g.,  "UCR" or "Cleves-Jain" or "graphs". For instance
 - "$HOME/Datasets/graphs/BZR/"
 - "$HOME/Datasets/UCR/Coffee/"
 - "$HOME/Datasets/Cleves-Jain/"

Links for datasets
 - Graphs can be obtained at : https://networkrepository.com/
 - UCR timeseries dataset can be obtained at https://www.cs.ucr.edu/~eamonn/time_series_data/ or at https://www.timeseriesclassification.com/dataset.php
 - Cleve Jains dataset can be obtained at https://www.jainlab.org/Public/SF-Test-Data-DrugSpace-2006.zip

## Timings
The timings results are given by the `timings.ipynb` python notebook.

## Cleves Jain's Dataset
The Cleves Jains dataset is small, and can be handled in python. The code is located in the notebook `JC.ipynb`.

## Other experiments
The other experiments are all handled by the `compute.py` file, which handles
 - Fetching the datasets
 - Formatting the datasets
 - Turning datasets into multiparameter or single-parameter simplextrees (which encode the filtrations), depending on the pipeline  
 - Computing vectorizations or kernels (1-parameter or multi-parameter), e.g., sliced wasserstein, persistence image, landscapes, our signed measure decompositions vectorization and kernel
 - Computing the classification
A help context is given by `python compute.py --help`. But we provide two examples below, that should each run in a few minutes at most.

### An example of computation : UCR
For instance, computing a classification of the time "UCR/ECG200" dataset, using a rips+density (**rd**) 2-filtration, with the sliced wasserstein kernel of our signed measure (**smk**) with the hilbert function of degree 0 and 1, can be achieved by:
```sh
python compute.py --pipeline rd_smk --dataset UCR/ECG200 --degrees 0 --degrees 1
```
Then, multiple parameters can be changed to improve the classification results, e.g.,
 - The resolution of the grid to compute the signed measure (`--resolution 1000`), 
 - Change the scale between each filtrations (`--num_rescale 3`),
 - The number of cross validation (`--train_k 10`),
 - The weights between degree 0 homology and degree 1 homology, 
 - The quantile of the filtration grid to drop (`--drop_quantile 0.01`)
 - How to infer the filtration on which to infer the filtration grid (`--infer_strategy regular`).
Some documentation about other trade-offs can be found using `python compute.py --help`.


Now, with the following command, which runs should run in less than 5 mins (on a computer equivalent to a 2020 macbook air), running the following command
```sh
python compute.py --pipeline rd_smk --dataset UCR/ECG200 --resolution 1000 --degrees 0 --degrees 1 --train_k 10 --infer_strategy regular --drop_quantile 0.01 --num_rescales 3
```

should end up with : 
```
Computing score...Best classification parameters :  {'DM2K__axis': 23, 'DM2K__sigma': 10, 'DM2K__weights': array([10. ,  0.1]), 'SVMP__C': 10, 'SVMP__kernel': 'precomputed'}
Done.
Accuracy UCR/ECG200 : 0.87 
```
The `DM2K__axis : 23` indicates that the cross validation chose the bandwidth "20% of the diameter of the dataset" (for the density estimation of rips+density) and rescaled the rips filtration by 0.5 and the density filtration by 1.

This can be found by looking, in the log of the program, at the line starting by "New axes" at the 23th entry. 
The `DM2K__weights : array([10. ,  0.1])` indicates that the degree 0 homology received a weight of `10` and the degree 1 homology received a weight of `0.1`.

### An example of computation : graphs
The same strategy can be applied to the graphs. 
Here we are computing the signed measure convolution (`multi_smi` the multi replaces the previous `rd`, and the `smi` represents this vectorization) on the graphs multi-filtrations given by 
 - the heat kernel signature at time 10 (hks_10) 
 - the closeness centrality (cc)
 - and the degree (degree).

The following computation should provide very decent results for the signed measure convolution, on the MUTAG dataset:
```sh
python compute.py --pipeline multi_smi --dataset graphs/MUTAG --filtrations hks_10 --filtrations cc --filtrations degree --train_k 5 --test_k 5 --infer_strategy exact --resolution 20
```
and as previously, adding a few parameter to the cross validation should improve the performance.


### One parameter pipelines
Note that the signed measure kernel (**smk**) needs a zero mass i.e., all cycles needs to die, thus we recommend to compute *extended persistence* on graphs, and threshold diagrams on points clouds.

For example, our signed measure sliced wasserstein can be computed on a time series and thresholded diagrams with
```sh
python compute.py --pipeline smk --dataset UCR/ECG200 --diagram_threshold -1
```
and the usual sliced wasserstein kernel on graph data, with heat kernel signature filtration, and extended persistence can be computed using 
```sh
python compute.py --pipeline sw --dataset graphs/BZR --extended -1 --filtration hks_10
```
