# RE: GNNBoundary
# Repository structure

The repository is structured as follows:
```
.
├── ckpts
│   ├── motif.pth
│   ├── enzymes.pth
│   ├── idmb.pth
│   ├── reddit.pth
│   └── collab.pth
├── data
│   ├── COLLAB
│   ├── ENZYMES
│   ├── IMDB
│   ├── Motif
│   ├── REDDIT-MULTI-5K
│   └── RedditDataset
├── experiments
│   ├── boundary_complexity
│   ├── boundary_margin
│   ├── boundary_thickness
│   ├── complexity_ranges
│   ├── figure3
│   ├── margin_ranges
│   ├── table1
│   ├── table2
│   └── thickness_ranges
├── experiments
│   ├── boundary
│   └── interpreter
├── gnnboundary
│   ├── criteria # Criteria for graph generation
│   ├── datasets # Dataset handling
│   ├── models # Model implementations
│   ├── tuning 
│   │   ├── boundary_graph_generation.py # Boundary graph generation helper class for tuning experiments
│   │   ├── tuning.py # Hyperparameter tuning for boundary graph generation
│   ├── utils
│   │   ├── boundary_analysis.py # Boundary analysis
│   │   ├── boundary_generator.py # Boundary graph generation
│   │   ├── boundary_evaluator.py # Implementation for all of the boundary evaluation: Complexity, Margin, Thickness
│   │   ├── random_baseline.py # Random baseline for boundary graph generation
│   └── visualization
├── scripts
│   ├── adjacency.py # Script to analyze adjacency
│   ├── default_configs.py # Important default configuration we use in all experiments
│   ├── experiments.py # Experiments are defined and implemnted here
│   ├── run_experiments.py # Script to run experiments
│   ├── save_graphs.py # Script to save generated graphs
│   ├── sampling_training.py # Script to sample and train the graph sampler in an easy way
│   └── utils.py # Utility functions for experiments

```

# Starting
To start, you need to install the required packages. You can do this by running the following command:
```sh
conda create -n gnnboundary poetry jupyter
conda activate gnnboundary
poetry install
```
There might be an error, ignore it and continue:
```
git clone https://github.com/yolandalalala/gnn-xai-common.git
cd gnn-xai-common
pip install -e .
cd ..
ipython kernel install --user --name=gnnboundary --display-name="GNNBoundary"
```

# Data
The datasets are downloaded and processed automatically when instantiating their class for the first time. Please download the Collab and Enzymes dataset manually from [here]{https://drive.google.com/file/d/1O3IRF9mhL2KCCU1eVlCEdssaf6y-pq2h/view?usp=sharing}. Put them in the data folder as can be seen in the file structure at the top.
It is just important, that the `processed` directory is contained in every dataset directory.

# Experiments
Experiments in this repository refer to both reproduction and extension studies. They are implemented in `scripts/experiments.py` and can be executed using the following command:

<details>
<summary> Click to expand </summary>

```sh
python -m scripts.run_experiment \
    --experiment ExperimentName \
    --dataset DatasetName \
    --temperature 0.2
```

## Available Experiments
To run an experiment, specify the `--experiment` argument with one of the following options:

- `adjacency_graph_gen`
- `boundary_complexity`
- `table1`
- `graph_gen`
- `table2`
- `figure3`
- `complexity_ranges`
- `margin_ranges`
- `thickness_ranges`

## Supported Datasets
The experiments support the following datasets:

- `collab`
- `enzymes`
- `motif`
- `reddit`
- `imdb`

## Arguments
The script provides a variety of options for configuring experiments:

| Argument | Type | Default | Description                                                                        |
|----------|------|---------|------------------------------------------------------------------------------------|
| `--experiment` | str | Required | Experiment name (see options above).                                               |
| `--dataset` | str | Required | Dataset name (see supported datasets above).                                       |
| `--temperature` | float | 0.2 | Temperature for sampling.                                                          |
| `--num_graphs` | int | 10 | Number of graphs to generate.                                                      |
| `--num_runs` | int | None | Number of runs for `table2`.                                                       |
| `--ckpt_path` | str | None | Checkpoint path for loading a trained model.                                       |
| `--num_iterations` | int | 100 | Number of iterations.                                                              |
| `--strategy` | str | `cross_entropy` | Training strategy (`dynamic_boundary` or `cross_entropy`).                         |
| `--class_pair` | str | `0,1` | Class pair for experiments (comma-separated).                                      |
| `--graph_directory` | str | None | Directory to load saved graphs.                                                    |
| `--save_dir` | str | `./results` | Directory to save experiment results.                                              |
| `--lr` | float | 1.0 | Learning rate.                                                                     |
| `--adj_threshold` | float | 0.8 | Adjacency threshold for adjacency analysis.                                        |
| `--learn_node_feat` | bool | True | Whether to learn node features.                                                    |
| `--w_budget_init` | float | None | Initial budget weight.                                                             |
| `--w_budget_inc` | float | None | Budget increment value.                                                            |
| `--w_budget_dec` | float | None | Budget decrement value.                                                            |
| `--max_nodes` | int | 25 | Maximum nodes for graph sampling.                                                  |
| `--target_size` | int | 30 | Target size for graph generation.                                                  |
| `--random_id` | int | None | Random ID for experiment tracking.                                                 |
| `--ranges` | str | `0.45,0.55;0.47,0.53;0.48,0.52;0.49,0.51;0.495,0.505` | Probability ranges for boundary statistic in relation to target range experiments. |
| `--interpreter_directory` | str | None | Directory for boundary analysis interpreter.                                       |
| `--reference_class` | int | None | Reference class for boundary margin and thickness.                                 |

## Example Usage
Run the `boundary_complexity` experiment on the `enzymes` dataset with a temperature of `0.3`:

```sh
python -m scripts.run_experimens \
    --experiment boundary_complexity \
    --dataset enzymes \
    --temperature 0.3 \
    --num_graphs 20 \
    --lr 0.5
```

Results will be saved in `./experiments` by default.

## Notes
- Ensure that all required arguments are provided.
- The checkpoint path should be specified if using a pre-trained model.
- Some experiments require additional parameters (e.g., `num_runs` for `table2`).
- You check the available arguments for each experiment by running 
```python -m scripts.run_experiments --help```

For further details, refer to the `scripts/experiments.py` and  `scripts/run_experiments.py` implementation.

## Experiments overview

There is a notebook named `experiments.ipynb` where we provided a interactive overview of the experiments. You can run them from there. In general, we recommend to set parameters like num_graphs, iterations or num_runs to low values to check the reproducibility in a reasonable amount of time. For full reproduction of our results, the values have to be set back to the ones we used again of course.

### Boundary Graph generation
This comes from a different file as it is the foundation for all our experiments. We put all the generated graphs into
this repository so that you can run the experiments without having to generate the graphs yourself.
The important parameters to set are:
- [Required] dataset: the dataset to use
- [Required] num_iterations: the number of iterations for training
- [Required] temperature: the temperature for sampling
- [Required] cls_pair: the pair of classes to generate the boundary graph for 
- [Required] num_graphs: the number of graphs to generate 
- [Required] num_runs: The number of runs to perform until the training is cancelled for that try
- max_nodes: the maximum number of nodes for graph generation
- target_size: the target size for graph generation
- target_probs: the target probabilities for convergence

For further information, refer to the `gnnboundary/utils/boundary_generator.py` and `scripts/save_graphs.py` implementation.

```sh
python -m scripts.save_graphs \
generate \
motif \
500 \ 
1000 \ 
--num_iterations=1000 \
--k_samples=32 \
--cls_pairs=0,1 \
--target_size=57 \
--lr=0.9 \
--temperature=0.41
```

### Interpreter Graph Generation
This comes from the same file as the boundary graph generation. The important parameters to set are:
- [Required] dataset: the dataset to use
- [Required] num_iterations: the number of iterations for training
- [Required] temperature: the temperature for sampling
- [Required] cls: the class to generate the interpreter graph for
- [Required] num_graphs: the number of graphs to generate
- [Required] num_runs: The number of runs to perform until the training is cancelled for that try
- max_nodes: the maximum number of nodes for graph generation
- target_size: the target size for graph generation
- target_probs: the target probabilities for convergence

```sh
python -m scripts.save_graphs \
interpreter
collab
500
1000
--num_iterations=500
--cls=1
--target_size=57
--lr=0.5
--temperature=0.41
--k_samples=32
--target_probs="0.9,1"
```

### Figure 1 / Adjacency analysis
Important parameters to set:
- dataset: the dataset to use
```sh
python -m scripts.run_experiments \
    --experiment adjacency \
    --dataset enzymes
``` 

### Boundary complexity
Important parameters to set:
- class_pair: the pair of classes to analyze
- dataset: the dataset to use
- temperature: the temperature for sampling
- num_graphs: the number of graphs to generate or to load

```sh
python -m scripts.run_experiments \
    --experiment boundary_complexity \
    --dataset enzymes \
    --temperature 0.2 \
    --class_pair 4,5 \
    --num_graphs 500
```

### Table 1
Important parameters to set:
- [Required] dataset: the dataset to use
- [Required] temperature: the temperature for sampling
- [Required] num_graphs: the number of graphs to generate or to load

```sh
python -m scripts.run_experiments \
    --experiment table1 \
    --dataset IMDB \
    --temperature 0.2 \
    --num_graphs 500
```

### Table 2
Important parameters to set:
- [Required] dataset: the dataset to use
- [Required] temperature: the temperature for sampling
- [Required] num_runs: the number of runs to perform
- [Required] arget_size: the target size for graph generation
- [Required] strategy: the training strategy to use. Can be either dynamic_boundary or cross_entropy
- max_nodes: the maximum number of nodes for graph generation
- w_budget_init: the initial budget weight
- w_budget_inc: the budget increment value
- w_budget_dec: the budget decrement value
- learn_node_feat: whether to learn node features
- target_probs: the target probabilities for convergence
```sh
python -m scripts.run_experiments \
--experiment="table2" \
--dataset="motif" \
--strategy="dynamic_boundary" \ 
--num_iterations=50 \
--num_runs=5 \
--lr=0.8 \
--temperature=0.5 \
--target_size=60 \
--target_probs="0.45,0.55" 
```

### Figure 3    
Important parameters to set:
- [Required] dataset: the dataset to use
- [Required] temperature: the temperature for sampling
- [Required]num_graphs: the number of graphs to generate or to load
- graph_directory: the directory to load saved graphs
- interpreter_directory: the directory for boundary analysis interpreter
```sh
python -m scripts.run_experiments \
    --experiment figure3 \
    --dataset motif \
    --temperature 0.2 \
    --num_graphs 500 \
    --graph_directory "./graphs/boundary/Motif" \
    --interpreter_directory "./graphs/interpreter/Motif"
```

### Relationship between boundary complexity and target range
Important parameters to set:
- [Required] dataset: the dataset to use
- [Required] class_pair: the pair of classes to analyze
- [Required] temperature: the temperature for sampling
- [Required] num_graphs: the number of graphs to generate or to load
- [Required] ranges: the probability ranges for boundary statistic in relation to target range experiments
- [Required] graph_directory: the directory to load saved graphs
```sh
python -m scripts.run_experiments \
    --experiment complexity_ranges \
    --dataset motif \
    --class_pair 0,1 \
    --temperature 0.2 \
    --num_graphs 500 \
    --ranges "0.45,0.55;0.47,0.53;0.48,0.52;0.49,0.51;0.495,0.505" \
    --graph_directory "./graphs/boundary/Motif/0-1"
```

### Relationship between boundary margin and target range
Important parameters to set:
- [Required] dataset: the dataset to use
- [Required] class_pair: the pair of classes to analyze
- [Required] temperature: the temperature for sampling
- [Required] num_graphs: the number of graphs to generate or to load
- [Required] ranges: the probability ranges for boundary statistic in relation to target range experiments
- [Required] graph_directory: the directory to load saved graphs
- [Required] interpreter_directory: the directory for boundary analysis interpreter
- [Required] reference_class: the reference class for boundary margin and thickness
```sh
python -m scripts.run_experiments \
    --experiment margin_ranges \
    --class_pair 0,1 \
    --dataset motif \
    --temperature 0.2 \
    --num_graphs 500 \
    --ranges "0.45,0.55;0.47,0.53;0.48,0.52;0.49,0.51;0.495,0.505" \
    --graph_directory "./graphs/boundary/Motif/0-1" \
    --interpreter_directory "./graphs/interpreter/Motif/0" \
    --reference_class 0
```

### Relationship between boundary thickness and target range
Important parameters to set:
- [Required] dataset: the dataset to use
- [Required] class_pair: the pair of classes to analyze
- [Required] temperature: the temperature for sampling
- [Required] num_graphs: the number of graphs to generate or to load
- [Required] ranges: the probability ranges for boundary statistic in relation to target range experiments
- [Required] graph_directory: the directory to load saved graphs
- [Required] interpreter_directory: the directory for boundary analysis interpreter
- [Required] reference_class: the reference class for boundary margin and thickness
```sh
python -m scripts.run_experiments \
    --experiment thickness_ranges \
    --class_pair 0,1 \
    --dataset motif \
    --temperature 0.2 \
    --num_graphs 500 \
    --ranges "0.45,0.55;0.47,0.53;0.48,0.52;0.49,0.51;0.495,0.505" \
    --graph_directory "./graphs/boundary/Motif/0-1" \
    --interpreter_directory "./graphs/interpreter/Motif/0" \
    --reference_class 0
```
### Random Baseline
To run the baseline experiment by sampling random graphs from the dataset, run:
```sh
python -m gnnboundary.utils.random_baseline --num_boundary_graphs 500 --class_samples Dataset
```
To do the same, but using class graphs derived from GNNInterpreter, run:
```sh
python -m gnnboundary.utils.random_baseline --num_boundary_graphs 500 --class_samples GNNInterpreter
```

</details>

# Hyperparameter tuning
We performed an extensive hyperparameter tuning for the boundary graph generation. The results can be found in the `tuning` folder. 
 Click to expand </summary>
To reproduce the results, you need to do the following steps:
1. Uncomment the desired dataset in `boundary_graph_generation.py`
2. Uncomment the corresponding output file name at the top in `tuning.py`
3. Run the following command
```sh
python -m gnnboundary.tuning.tuning
```
To run starting from previous results, enter the path at the end of `tuning.py` and set the number of random starts to 0.


