# GraWalkER: Vulnerability Detection via Code Semantic Fusion Graph with Edge-Aware Random Walk

This program provides the implementation of our paper "GraWalkER: Vulnerability Detection via Code Semantic Fusion Graph with Edge-Aware Random Walk". In this paper, we propose GraWalkER, a novel framework that integrates a multidimensional Code Semantic Fusion Graph (CSFG) and Edge-aware Random Walk with Unifying Memory (ERUM) for code vulnerability detection.

# Overview 

In this repository, you will find a Python implementation of our GraWalkER. As described in our paper, GraWalkER formulate the vulnerability detection task as a graph classification problem. First, we utilize Joern to transform the source code of functions into a foundational CPG. Subsequently, through node filtering and sequence connection operations, we obtain the initial CSFG representation. Next, we employ a pre-trained word2vec model to embed node-level code tokens within the CSFG, combining them with node type attributes to derive comprehensive node feature representations. Then, we propose an ERUM approach that generates random walks terminating at each node. This method merges topological and semantic graph features according to the node characteristics and edge correspondences in the CSFG. Finally, we utilize a MLP classifier to perform code defect detection.

# Setting up the environment

You can set up the environment by following commands.
```
conda create -n GraWalkER python=3.11
conda activate 
pip install pytorch==2.4.0+cu121 torchvision==0.19.0+cu121 torchaudio==2.4.0+cu121
pip install gensim==4.3.3
pip install dgl==2.4.0+cu121
```
In addition, we still use Joern==4.0.365 for data preprocessing. If you need to perform data preprocessing from scratch, please ensure that Joern has been successfully installed.

# Data preprocess

The initial data needs to be preprocessed according to the instructions in./data/preprocess, and the data obtained after preprocessing is saved in the./data/processed directory. The entire pre-training process will take some time. We have provided the completed version of the dataset processing in the paper.

# Pre-trained Word2vec model

We provide pre-training scripts for Word2vec models for different datasets. See./utils/train_word2vec.py. After the pre-training is completed, replace the word2vec_path parameter in the run.sh script with the path of the pre-trained model obtained.

# Training and Evaluation

```
chmod u+x run.sh
./run.sh
```
This command is used to train GraWalkER model. For more hyperparameter Settings, please refer to the description in the run.py file.