Malware Family Prediction with an Awareness of Label Uncertainty

Joon-Young Paik; Rize Jin

Malware Family Prediction with an Awareness of Label Uncertainty

Joon-Young Paik, Rize Jin

Published: 01 Jan 2024, Last Modified: 14 Oct 2024Comput. J. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Malware family prediction has been mainly formulated as a multiclass classification to predict one malware family. This approach suffers from label uncertainty, which can mislead malware analysts. To render malware prediction less susceptible to uncertainty, malware family prediction, which entails predicting one or more families, is performed in this study. In this regard, an encoder–decoder malware family prediction model, EnDePMal, with label uncertainty awareness, is proposed. EnDePMal aims to predict all malware families related to samples and preserve their priorities. It comprises a residual neural network-based encoder and a long short-term memory-based decoder with an attention mechanism. The model uses a sequence of malware family names, but not a family name, as a label. Once a visualized malware image is input into EnDePMal, its encoder extracts the important features from the image. Subsequently, its decoder generates family names, where the attention mechanism allows it to focus on relevant features by attending to the encoder’s output. Experimental results show that EnDePMal can predict 77.64% of malware family sequences that preserve their priorities. Moreover, it achieves an accuracy of 93.49% and an F1-score of 0.9282 for malware families with the highest priority, rendering it comparable to the typical multiclass classification model.

Loading