Centroid-Based Learning for Malware Detection and Novel Family Identification

Saranya Vijayakumar; Zifan Wang; Yuhang Yao; Matt Fredrikson

Centroid-Based Learning for Malware Detection and Novel Family Identification

Saranya Vijayakumar, Zifan Wang, Yuhang Yao, Matt Fredrikson

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: societal considerations including fairness, safety, privacy

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: malware; graphs; GNN;

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: New malware family detection method that implements a Centroid GNN Model on control flow graphs

Abstract: Detecting out-of-distribution (OOD) data categories while preserving the accuracy of existing classifications is a pressing challenge in many domains. Conventional methods often falter when tasked with generating or identifying new data classes, especially when dealing with graphical data and the problem of graph isomorphism. In this paper, we present a novel approach, the Graph Centroid Model (GCM), which combines Control Flow Graphs (CFGs) with a Graph Neural Network (GNN) to address this challenge effectively. The GCM assigns embeddings produced by a GNN to partitions that support the classification of both known and new classes, even those absent during training. Our approach quantifies the differences between samples in the embedding space, enabling the identification of multiple distinct representations of familiar classes during training while providing a straightforward mechanism for detecting new classes during testing. This not only improves classification accuracy but also offers intuitive visualizations that provide valuable insights.When applied to a benchmark malware dataset (BODMAS), our method reveals structural commonalities among samples from different malware families while effectively discerning new, previously unseen classes based on their distance from learned representatives in the embedding space.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8386

Loading