Abstract: Malware poses a persistent threat to network security. Machine learning models, owing to their remarkable detection efficiency and generalization prowess, are prevalently utilized for malware detection and classification. Nonetheless, the prevalent class imbalance in malware datasets poses a significant hurdle, affecting the training accuracy of such models. To address this challenge, this paper introduces a novel data augmentation method for malware classification, named Family Similarity-Enhanced Data Augmentation (FSDA) (Code is available at https://github.com/jasonqzs/FSDA), which leverages the concept of family similarity to achieve implicit data augmentation techniques. FSDA introduces pertinent family-specific features into long-tailed classes, enabling effective classification of malware belonging to such families. Specifically, the methodology initially leverages long-tailed distributed data to train the model’s backbone and classifier. Subsequently, it estimates the covariance matrix for each class and constructs a knowledge graph that captures the intricate relationships between any two classes. Finally, the approach adaptively enriches tail samples by propagating information from all similar classes within the knowledge graph. The experiments conducted on two prevalent datasets, Malimg and MaleVis, demonstrate the efficacy of the proposed FSDA method.
Loading