VulLibGen: Generating Names of Vulnerability-Affected Packages via a Large Language Model

Anonymous

VulLibGen: Generating Names of Vulnerability-Affected Packages via a Large Language Model

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: To avoid potential risks posed by vulnerabilities in third-party packages, security practitioners maintain vulnerability reports in vulnerability databases (e.g., GitHub Advisory) to help developers realize and deploy vulnerability patches.However, existing work shows that in more than half of the vulnerability reports, the field of vulnerability-affected packages is missing or incorrect. To help reduce the manual efforts in completing and validating the affected-package field, existing work proposes to automatically identify this information. However, all existing work suffers from low accuracy, relying on relatively small models such as logistic regression and BERT due to linear time cost to the number of packages under consideration. To address these limitations, we propose the first work, a framework named VulLibGen, to explore the use of a large language model (LLM) for directly generating the names of affected packages. VulLibGen conducts supervised fine-tuning (SFT) and retrieval augmented generation (RAG) to supply domain knowledge to the LLM, and a local search technique to ensure that the generated name of an affected package is among the names of the packages under consideration. Our evaluation results show that VulLibGen has an average accuracy of 0.806 for identifying vulnerable packages in the four most popular ecosystems in GitHub Advisory (Java, JS, Python, Go) while the best SOTA ranking approaches achieve only 0.721. Additionally, VulLibGen has provided high value to security practice: we have submitted 28 pairs of <vulnerability, affected package> to GitHub Advisory, and 22 of them have been accepted and merged.

Paper Type: long

Research Area: NLP Applications

Contribution Types: Publicly available software and/or pre-trained models

Languages Studied: English

0 Replies

Loading