# Score matching through the roof

This is the code accompanying the paper "Score matching through the roof: linear, nonlinear, and latent variables causal
discovery.
"

## Installation

To run our code, install requirements via

    pip install -r requirements.txt


## Usage

### Stand-alone AdaScore

To use our AdaScore algorithm just import the `AdaScore` class. Example usage could be

    import pandas as pd
    from causal_discovery.adascore import AdaScore 
    df = pd.read_csv('data.csv')
    algo = AdaScore(alpha_orientation=.05, alpha_confounded_leaf=.05, alpha_separations=.05)
    graph = algo.fit(df)

The result is a `networkx.DiGraph()`.

### Running Experiments from the paper

To run the experiments shown in the paper use e.g. the following commands

    python3 benchmark/bench_increasing_size.py --algorithms ridge nogam --num_nodes 3 5 7 --p_edge .3
    python3 benchmark/bench_increasing_samples.py --algorithms ridge nogam --num_samples 300 500 --p_edge .3
    python3 benchmark/bench_real_data.py --dataset sachs

Note, that here we indicate AdaScore with different regression functions by the name of the regression method (e.g. `ridge` or `xgboost`).
These scripts allow to run multiple algorithms at once.
Please see the respective files for available algorithms and their command line parameters.
Further, the files allow to specify certain hyperparameters of the algorithms.
Most notably, the alpha-threshold for the AdaScore algorithm.
Since most of our baselines also have an alpha parameter, the script offers to pass `alpha_others` for all algorithms
except AdaScore.

Of course, the script also offers control over the generated data.
E.g. with `num_hidden` the number of randomly selected hidden variables can be controlled and with `p_edge` the
probability of adding an edge to the Erdos-Renyi ground truth graph can be controlled.

The results are stored in the dir `logs/paper-plots` with a timestamp.
The parameters of each run are found in the `params.json` file and in the folders `data` and `graphs` are the generated
data matrices, the ground truth graph and all estimated graphs.

Single plots can be generated via

    python3 plots/plot_metrics_incr_size.py
    python3 plots/plot_metrics_incr_samples.py
    python3 plots/plot_time_incr_samples.py
    python3 plots/plot_time_incr_size.py

where the parameters allow to controll which algorithms are included in the plots and which metric should be plotted.
The parameter `--newest i` plots the `i`-th newest log folder.

To reproduce the exact plots of the paper move the respective log folder from above into a specific folder  `logs/paper-plots/incr_size/` or  `logs/paper-plots/incr_samples/` make sure this folder contains only one subfolder per parameter
setting and then use

    python3 plots/paper-plots.py

where the command line parameters of the file control which experimental setting is plotted.
The plots from Figure 2 can be generated via

    python3 plots/plot_metrics_bnlearn.py 

without manually moving folders.

## Remarks on the code structure

Overall the project is structured as follows:
The `causal_discovery` folder contains our algorithm, including an oracle-version that replaces finite-sample estimates
with graphical calculations on the ground truth graph.
The `AdaScore` class can be used as stand-alone implementation.

The `benchmark` folder contains everything required to run the benchmark, including the code that calls the `causally`
data generation, the metrics that we use and the benchmarking scripts themselves.
It is worth noting that we added a `utils/algo_wrapper.py` file to unify the interface of algorithms from different
packages.
The classes in `utils/causal_graphs.py` allow us to dynamically call the correct metrics (for different graph types)
without `if-else` clauses in the main scripts.

`plots` contains the respective scripts to plot the data and `tests` contains unit tests for our code.



