# Rethinking Graph Structure Learning For Graph Neural Networks

## Overview
The repository is organised as follows:
```
| -- train.py # Main training framework
| -- config.py # Configurations and descriptions of hyper-parameters
| -- datasets.py # Dataset class and homophily metrics of label, structural, and feature homophily
| -- preprocess_dataset.py # Preprocess real-world datasets
| -- utils.py # Some useful functions
| -- /data # Folder to place preprocessed datasets
| -- /experiments # Folder to save experimental results
```
## Environment

```
python==3.7.16
torch==1.12.0+cu116
torch-cluster==2.1.1
torch-geometric==2.3.1
torch-scatter==2.1.1
torch-sparse==2.1.1
torch-spline-conv==1.2.2
dgl==1.1.2+cu116
ogb==1.3.6
numpy==1.19.2
scipy==1.7.3
networkx==2.3
```

## Data Preparation
#### 1. Download datasets
Datasets of roman-empire, amazon-ratings, minesweeper, tolokers, questions, squirrel-filtered, chameleon-filtered, actor, texas-4-classes, cornell, and wisconsin are from https://github.com/yandex-research/heterophilous-graphs/tree/main/data.


Please download and put the preprocessed datasets in `/data`

#### 2. Preprocessed datasets
Datasets of cora, pubmed, and citeseer are downloaded from preprocess_dataset.py
```
load_new_dataset(dataset_name=<DATASET_NAME>,split_type='random', train_prop=0.6, valid_prop=0.2, num_data_splits=10)
```

The preprocessed `<DATASET_NAME>.npz` file will be placed in `/data`

## Training

### GSL Setting

Set `rewrite_basis=do_not_rewrite` to run GNNs without GSL, otherwise enable GSL.

1. `rewrite_basis` controls the GSL basis. Options include:
   - `feature`: the original input features,
   - `agge_feature`: features aggregated after 1-hop,
   - `grace_feature`: pretrained features from the unsupervised graph contrastive learning method GRACE,
   - `gcn_feature`: node embeddings pretrained by GCN,
   - `mlp_feature`: node embeddings pretrained by MLP.

2. `rewrite_construct` specifies the method for constructing new graphs in GSL, with the following choices:
   - `cos_sim_graph`: constructs new edges based on graph-level cosine similarity,
   - `cos_sim_node`: constructs new edges based on node-level cosine similarity,
   - `knn`: constructs new edges using the k-Nearest-Neighbors algorithm.

3. `rewrite_construct_param` adjusts the GSL graph's hyperparameters. When using `cos_sim_*`, it controls the ratio of new edges to the original graph. When using `knn`, it defines the number of nearest neighbors (k).

4. `rewrite_fusion` determines the types of graph views in GSL, with options:
   - `only_old`: uses the original graph only,
   - `only_new`: uses the GSL graph only,
   - `both_share_param`: incorporates both graphs, sharing the same model parameters,
   - `both_seperate_param`: incorporates both graphs with separate model parameters.

5. `rewrite_fusion_state` defines the fusion stage for combining graph views, with the options:
   - `early`: early fusion,
   - `late`: late fusion.

For details on other commonly used model hyperparameters, please refer to `config.py`.

For example, the command:
```
python train.py --dataset minesweeper --model GCN --rewrite_basis feature --rewrite_construct knn --rewrite_construct_param 5 --rewrite_fusion both_seperate_param --rewrite_fusion_state late
```
enables GSL. The GSL graph is constructed using the kNN algorithm based on the input features, with k set to 5. Both the GSL and original graphs are fed into the GCN with separate parameters. The results are:
```
Val ROC AUC mean: 0.8692
Val ROC AUC std: 0.0088
Test ROC AUC mean: 0.8717
Test ROC AUC std: 0.0059
```

To run a GNN without GSL, use the command for example:
```
python train.py --dataset minesweeper --model GCN --rewrite_basis do_not_rewrite --rewrite_fusion only_old
```
This yields the following results:
```
Val ROC AUC mean: 0.8854
Val ROC AUC std: 0.0099
Test ROC AUC mean: 0.8874
Test ROC AUC std: 0.0060
```

## License
MIT