Abstract: Learning-based Android malware detection has earned significant recognition across industry and academia, yet its effectiveness hinges on the accuracy of labeled training data. Manual labeling, being prohibitively expensive, has prompted the use of automated methods, such as leveraging anti-virus engines like VirusTotal, which unfortunately introduces mislabeling, aka "label noise". The state-of-the-art label noise reduction approach, MalWhiteout, can effectively reduce random label noise but underperforms in mitigating real-world emergent malware (EM) label noise stemming from newly emerging Android malware variants overlooked by VirusTotal. To tackle this, we conceptualize EM label noise detection as an anomaly detection problem and introduce a novel tool, MalCleanse, that surpasses MalWhiteout's ability to address EM label noise. MalCleanse combines uncertainty estimation with unsupervised anomaly detection, identifying samples with high uncertainty as mislabeled, thereby enhancing its capability to remove EM label noise. Our experimental results demonstrate a significant reduction in EM label noise by approximately 25.25%, achieving an F1 Score of 80.32% for label noise detection at a noise ratio of 40.61%. Notably, MalCleanse outperforms MalWhiteout with an increase of 40.9% in overall F1 score for mitigating EM label noise. This paper pioneers the integration of deep neural network model uncertainty to refine label accuracy, thereby enhancing the reliability of malware detection systems. Our approach represents a significant step forward in addressing the challenges posed by emergent malware in automated labeling systems.
External IDs:dblp:journals/pacmse/LiCZXXW25
Loading