Automation of Vulnerability Information Extraction Using Transformer-Based Language Models

Fateme Hashemi Chaleshtori, Indrakshi Ray

Published: 01 Jan 2022, Last Modified: 18 Jun 2024CyberICPS/SECPRE/SPOSE/CPS4CIP/CDT&SECOMAN/EIS/SecAssure@ESORICS 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Identifying and mitigating vulnerabilities as rapidly and extensively as possible is essential for preventing security breaches. Thus, organizations and companies often store vulnerability information, expressed in natural language, and share them with other stakeholders. Disclosure and dissemination of this information in a structured and unambiguous format in a timely manner is crucial to prevent security attacks. Many existing automated vulnerability information extraction techniques use rule-based strategies like Pattern Matching and Part-of-Speech Tagging, and Machine Learning models built on Conditional Random Fields (CRF). There are also hybrid models that integrate NLP and pattern recognition to create semi-automated systems. We propose an alternative approach using Transformer models, including BERT, XLNet, RoBERTa, and DistilBERT, which have been shown to have promising performance in many NLP downstream tasks such as for Named Entity Recognition (NER) and Co-reference Resolution in an end-to-end neural architecture. We fine-tune several language representation models similar to BERT, on a labeled dataset from vulnerability databases, for the task of NER so that we can automatically extract security-related words and terms and phrases from descriptions of the vulnerabilities. Our approach allows us to extract complex features from the data without requiring feature selection, thus eliminating the need for domain-expert knowledge. It also outperforms the CRF-based models. Additionally, it is able to detect new information from vulnerabilities whose description text patterns differ from those specified by rule-based systems.