# GIT: Graph Generality Identify on Task-Trees

## Overview Structure

```
.
|- config
|- data_preprocessing
|- data
|- model
|- task
|- utils
|- pretrain.py
|- sft.py
|- finetune.py
```

The config folder contains the hyper-parameters for the model. The data_preprocessing folder contains the function for
data preprocessing. The data folder contains the dataset loader used in the
paper, the model folder contains the model implementation, the task folder contains the task definition, and the utils
folder contains the utility functions. The `pretrain.py` file is used for pre-training the model, and the `sft.py` file
is used for source-free transfer learning. The `finetune.py` file is used for fine-tuning the model.

## Setup Environment

We use conda for environment setup. Please run the bash as


```
conda env create -f environment.yml
conda activate GFM
```

## Dataset Preparation

Please use the data_preprocessing folder to download the dataset. The structure is as follows.

```
.
|- configs
   |- data_config.yaml
   |- task_config.yaml
   |- default_config.yaml
|- data
|- ...
|- ofa_data_process.py
```

You should indicate the dataset you want to process in the 'default_config.yaml' file, under the "task_name" term.

Here are the datasets you can use.

- cora_node
- citeseer_node
- pubmed_node
- arxiv
- arxiv23
- dblp_node
- bookhis
- bookchild
- elecomp
- elephoto
- sportsfit
- amazonratings
- products
- FB15K237
- WN18RR
- codex_s
- codex_m
- codex_l
- ICEWS1819
- NELL995
- GDELT
- chemblpre
- chempcba
- chemhiv
- tox21
- toxcast
- muv
- cyp450
- bace
- bbbp
- enron
- googlemap_ct

PS: The code is from the [paper](https://arxiv.org/pdf/2406.10727). We thank the authors for their contribution.

## Pretrain

To pretrain the model, please run `pretrain.py` by specifying the experiment setting. Here is an example.

```
python pretrain.py --dataset default --fanout 10 --num_layers 2 --lr 1e-7 --edge_p 0.2 --feat_p 0.2 --align_reg_lambda 10
```

## Specialization (SFT)

To perform SFT, please run `sft.py` by specifying the experiment setting. Here is an example.

```
python sft.py --pt_data default --save --data arxiv --lr 1e-7 --pt_lr 1e-7 --pt_feat_p 0.2 --pt_edge_p 0.2 --pt_align_reg_lambda 10 --pt_epochs 10 --epochs 500
```

The `--data` term indicate the SFT data. In our paper, we use arxiv, products, pcba, and FB15K237 datasets for SFT.

## Fine-tuning

We consider three settings, including basic fine-tuning, in-context learning, and zero-shot learning.

Here is an example of basic fine-tuning.

```
python finetune.py --pt_data default --sft_data arxiv --use_params --setting base --dataset cora 
```

You can use the `--pt_data` and `--sft_data` terms to specify the pretrain and SFT data, respectively. Note that you can
choose not to set `--sft_data' term, and only use the pretrained model for fine-tuning.

You can use `--use_params` term to set the default hyper-parameters.

Here is an example of in-context learning.

```
python finetune.py --pt_data default --sft_data arxiv --no_split --use_params --setting in_context --dataset cora
```

Here is an example of zero-shot learning.

```
python finetune.py --pt_data default --sft_data arxiv --no_split --use_params --setting zero_shot --dataset cora
```

Note the `--no_split` term is used to indicate that it is unnecessary to sample the testing and validation tasks in the
original and validation sets, respectively. Instead, we randomly sample tasks from the original set, i.e., from all
instances. 