Slowing Down the Aging of Learning-Based Malware Detectors With API KnowledgeDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 28 Sept 2023IEEE Trans. Dependable Secur. Comput. 2023Readers: Everyone
Abstract: Learning-based malware detectors are widely used in practice to safeguard real-world computers. One major challenge is known as model aging, where the effectiveness of these models drops drastically as malware variants keep evolving. To tackle model aging, most existing works choose to label new samples to retrain the aged models. However, such data-perspective methods often require excessive costs in labeling and retraining. In this article, we observe that during evolution, malware samples often preserve similar malicious semantics while switching to new implementations with semantically equivalent APIs. Such observation enables us to look into the problem from a different perspective: feature space. More specifically, if the models can capture the intrinsic semantics of malware variants from feature space, it will help slow down the aging of learning-based detectors. Based on this insight, we design <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">APIGraph</small> to automatically extract API knowledge from API documentation and incorporate these knowledge into the training of malware detection models. We use <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">APIGraph</small> to enhance 5 state-of-the-art malware detectors, covering both Android and Windows platforms and various learning algorithms. Experiments on large-scale, evolutionary datasets with nearly 340K samples show that <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">APIGraph</small> can help slow down the aging of these models by 5.9% to 19.6%, as well as reduce labeling efforts from 33.07% to 96.30% on top of data-perspective methods.
0 Replies

Loading