XGV-BERT: Leveraging contextualized language model and graph neural network for efficient software vulnerability detection
Abstract: With the advancement of deep learning in various fields, there are many attempts to reveal software vulnerabilities by data-driven approach. Natural language processing has emerged as a powerful tool for bridging the semantic gap between programming languages and natural language. However, a significant disparity between the two still exists. In this work, we propose XGV-BERT, a framework that combines the pre-trained CodeBERT model and graph neural network to detect software vulnerabilities. By jointly training the CodeBERT and graph neural network modules within XGV-BERT, the proposed model leverages the advantages of large-scale pre-training, harnessing vast raw data, and transfer learning by learning representations for training data through graph convolution. The research results demonstrate that the XGV-BERT method significantly improves vulnerability detection accuracy compared to two existing methods such as VulDeePecker and SySeVR. For the VulDeePecker dataset, XGV-BERT achieves an impressive F1-score of 97.5%, significantly outperforming VulDeePecker, which achieved an F1-score of 78.3%. Again, with the SySeVR dataset, XGV-BERT achieves an F1-score of 95.5%, surpassing the results of SySeVR with an F1-score of 83.5%.
Loading