Real-World Code Vulnerability Detection Framework: From Data Preprocessing to Multi-Feature Fusion Detection
Abstract: Code vulnerability detection (CVD) is a critical approach to ensuring the security, stability, and reliability of software. When exploited by malicious actors or hackers, code vulnerabilities can lead to a series of severe consequences and cause significant losses. However, the effectiveness of real-world CVD is currently unsatisfactory, with several issues that need to be optimized and resolved. These issues include the poor quality of real-world CVD datasets, the high resource consumption of code intermediate structures, the imbalance between positive and negative samples, and insufficient feature modeling of code vulnerabilities. To address these challenges, we propose a comprehensive and efficient framework for real-world CVD called MARCOVul. It offers a complete process from data preprocessing to final vulnerability detection, optimizing the entire real-world CVD pipeline. Our approach begins with a data derivation technique which seeks to improve the overall quality of the dataset. Next, we propose a code-specific data augmentation method to tackle the issue of sample imbalance in the dataset. We then propose a code intermediate structure simplification method to reduce computational complexity and resource consumption while fully leveraging the power of language models. Finally, we propose a real-world CVD method based on multi-feature fusion to identify potential security vulnerabilities in the code. Experiments on a large-scale real-world CVD dataset demonstrate the effectiveness of MARCOVul in detecting real-world code vulnerabilities, achieving up to 12.75% Binary-F1 score (BF1) and 6.98% matthews correlation coefficient (MCC) improvements over the best unweighted baselines, and 1.55% BF1 and 1.42% MCC gains over the best weighted ones.
External IDs:dblp:journals/tdsc/ZhangDLLHL25
Loading