# Goodreads_children

This repository includes the following example scripts:

+ **Document GNN**: GNN is empowered by LLM embedding using GraphTransformer operator for **link prediction** and **edge classification**.

## Requirements

+ PyG >= 2.4
+ [info_nce](https://github.com/RElbers/info-nce-pytorch)

## Data Introduction

`children_genre` dataset is extracted from the [goodreads_children](https://mengtingwan.github.io/data/goodreads.html). The edges of `children_genre graph` consists of user-review-book, book-description-genre.

## Key components in Link2doc

+ `document_embedding.py` generates documents from a TEG.
+ `classifier_link_prediction.py` performs link prediction.

## Set Up

### Get data

+ Retrieve data.

``` bash
wget https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/goodreads/byGenre/goodreads_reviews_children.json.gz -O goodreads_reviews_children.json.gz
gzip -d goodreads_reviews_children.json.gz
wget https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/goodreads/goodreads_book_genres_initial.json.gz -O goodreads_book_genres_initial.json.gz
gzip -d goodreads_book_genres_initial.json.gz
wget https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/goodreads/goodreads_books_children.json.gz -O goodreads_books_children.json.gz
gzip -d goodreads_books_children.json.gz
```

+ Organize data as follows:

``` bash
├─children_genre
|       ├─goodreads_children_genre.py
        |─__init__.py
|       ├─raw
|       |  ├─goodreads_books_children.json
|       |  ├─goodreads_book_genres_initial.json
|       |  └goodreads_reviews_children.json
```

### Get edge_text LLM embedding

+ get edge_text LLM embedding through openAI and save this embedding in `children_genre/raw` (**please modify this code to align your data**).

```bash
python get_openAI_embedding.py
```

### Load data to the nx.Graph

+ load data to `nx.Graph()` and save this graph in `children_genre/raw`.

```bash
python nx_graph_loader.py

```

### Get path graph document 

+ Get path graph document and save this dict in `children_genre/raw`.

```bash
python document_embedding.py
```

### Get document LLM embedding

+ Get path graph document embedding (**please modify this code to align your data**) and save it in `children_genre/raw`.

```bash
python get_openAI_path_embedding.py
```

### Downstream tasks

+ GNN Link prediction with LLM embedding

```bash
python classifier_link_prediction.py
```

+ GNN Edge classification with LLM embedding

The final file structure is as follows.

```bash
├── goodreads
│   ├── children_genre
│   │   ├── goodreads_children_genre.py
│   │   ├── processed
│   │   │   ├── data.pt
│   │   │   ├── pre_filter.pt
│   │   │   └── pre_transform.pt
│   │   └── raw
│   │       ├── 0.15_train_embeddings.pkl
│   │       ├── discription.npy
│   │       ├── edge_attr_book_genre.npy
│   │       ├── goodreads_book_genres_initial.json
│   │       ├── goodreads_books_children.json
│   │       ├── goodreads_reviews_children.json
│   │       ├── nx_graph.pkl
│   │       └── review.npy
│   ├── classifier_link_prediction.py
│   ├── document_embedding.py
│   ├── get_openAI_embedding.py
│   ├── nx_graph_loader.py
│   └── text_graph.py
└── readme.md
```
