Malware Detection Using Pseudo Semi-Supervised Learning

Published: 2022, Last Modified: 12 Nov 2025ICPRAI (2) 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Malware, due to its ever-evolving nature, remains a serious threat. Sophisticated attacks using ransomware and viruses have crippled organizations globally. Traditional heuristic and signature-based methods have failed to keep up and are easily evaded by such programs. Machine learning-based methods can alleviate this concern by detecting inherent and persistent structures in the malware unrecognized by heuristic methods. Supervised learning methods have been used previously, but they need vast labeled datasets. Semi-supervised learning can address these issues by leveraging insights gained from unlabeled data. In this paper, we present a novel semi-supervised learning framework that can identify malware based on sparsely labeled datasets. This framework leverages the global and local features learned by the combination of k-NN and CNN to generate pseudo labels for efficient training using both labeled and unlabeled samples. The combined loss of the models regularizes the neural network. The performance of this framework is compared against popular semi-supervised approaches such as LapSVM, TSVM, and label propagation on the Embers dataset. The proposed framework achieved a detection accuracy of 72% with just 10% of the labeled training samples. Hence, these results demonstrate the viability of semi-supervised methods in large-scale malware detection systems.
Loading