LLMs Struggle to Differentiate Vulnerable Code from Patched Code: An Empirical Study and Knowledge-level Enhancement Framework
Abstract: Although LLMs have shown promising potential in vulnerability detection, this study reveals their limited ability to distinguish between vulnerable code and similar-but-benign patched code (i.e., only 0.04 - 0.06 accuracy). It indicates that LLMs struggle to capture the root causes of vulnerabilities during vulnerability detection.
To address this challenge, we propose enhancing LLMs with multi-dimension vulnerability knowledge distilled from historical vulnerabilities. We design a novel knowledge-level Retrieval-Augmented Generation (RAG) framework Vul-RAG, which improves LLMs with an accuracy increase of 22% - 25% in identifying vulnerable and patched code. Additionally, Vul-RAG generated vulnerability knowledge can (1) serve as high-quality explanations to improve manual detection accuracy (17% increase), and (2) detect 10 previously-unknown bugs in the recent Linux kernel release (6 have been confirmed by the Linux community).
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: code generation and understanding, security/privacy
Contribution Types: NLP engineering experiment
Languages Studied: C/C++
Submission Number: 1382
Loading