# Making Large Language Models Perform Better in Knowledge Graph Completion

> Large language model (LLM) based knowledge graph completion (KGC) aims to predict the missing triples in the KGs with LLMs and enrich the KGs to become better web infrastructure, which can benefit a lot of web-based automatic services. However, research about LLM-based KGC is limited and lacks effective utilization of LLM's inference capabilities, which ignores the important structural information in KGs and prevents LLMs from acquiring accurate factual knowledge. In this paper, we discuss how to incorporate the helpful KG structural information into the LLMs, aiming to achieve structrual-aware reasoning in the LLMs. We first transfer the existing LLM paradigms to structural-aware settings and further propose a knowledge prefix adapter (KoPA) to fulfill this stated goal. KoPA employs structural embedding pre-training to capture the structural information of entities and relations in the KG. Then KoPA informs the LLMs of the knowledge prefix adapter which projects the structural embeddings into the textual space and obtains virtual knowledge tokens as a prefix of the input prompt. We conduct comprehensive experiments on these structural-aware LLM-based KGC methods and provide an in-depth analysis comparing how the introduction of structural information would be better for LLM's knowledge reasoning ability.

## 🌈 Model Architecture
![Model_architecture](figure/model.png)


## 🔬 Dependencies
Our code is developed based on the open-source project [alpaca-lora](https://github.com/tloen/alpaca-lora). Please build the Python following the instruction in Alpaca-lora.

Some core python library config: 
- Python 3.9.16
- torch 2.0.0
- transformers 4.28.0
- peft 0.3.0

To run the experiments, you need as least one GPU with at least 80G CUDA memory. In our experiments we used a Linux server with Ubuntu installed and equipped with three A800 GPUs, one GPU for each individual experiment.

## 🌲 Data Preparation
Due to the size of the data, you need to unzip the data file `data.zip` place them in the `data/`. **The datafile can be downloaded from the supplemental material in OpenReview**.

## 📕 Training & Test

## Training
- To conduct the training process, you can run the following shell scripts. It is worth noting that the `epoch` parameter can be adjusted among 3/4/5 for best results.

```shell
# For UMLS dataset
export WANDB_DISABLED=true
wandb offline
CUDA_VISIBLE_DEVICES=0 nohup python finetune_kopa.py \
    --base_model 'YOUR LLM PATH' \
    --data_path 'data/UMLS-train.json' \
    --output_dir 'YOUR SAVE PATH' \
    --num_epochs 3 \
    --lora_r 64 \
    --learning_rate 3e-4 \
    --batch_size 12 \
    --micro_batch_size 12 \
    --num_prefix 1 \
    --kge_model 'data/UMLS-rotate.pth' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' > log.txt &
```

```shell
# For CoDeX-S dataset
export WANDB_DISABLED=true
wandb offline
CUDA_VISIBLE_DEVICES=0 nohup python finetune_kopa.py \
    --base_model 'YOUR LLM PATH' \
    --data_path 'data/CoDeX-S-train.json' \
    --output_dir 'YOUR SAVE PATH' \
    --num_epochs 3 \
    --lora_r 64 \
    --learning_rate 3e-4 \
    --batch_size 12 \
    --micro_batch_size 12 \
    --num_prefix 1 \
    --kge_model 'data/CoDeX-S-rotate.pth' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' > log.txt &
```

```shell
# For FB15K-237N dataset
export WANDB_DISABLED=true
wandb offline
CUDA_VISIBLE_DEVICES=0 nohup python finetune_kopa.py \
    --base_model 'YOUR LLM PATH' \
    --data_path 'data/FB15K-237N-train.json' \
    --output_dir 'YOUR SAVE PATH' \
    --num_epochs 3 \
    --lora_r 64 \
    --learning_rate 3e-4 \
    --batch_size 12 \
    --micro_batch_size 12 \
    --num_prefix 1 \
    --kge_model 'data/FB15K-237N-rotate.pth' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' > log.txt &
```


You may need to fill the LLM path and save path before running.

## Inference
```shell
CUDA_VISIBLE_DEVICES=0 python inference_kopa.py
```
Before run the inference code, you should edit the `test_data_path` and `lora_weights` for the corresponding dataset.
