Abstract: Analyzing malware based on API call sequences is an effective approach, as these sequences reflect the dynamic execution behavior of malware. Recent advancements in deep learning have facilitated the application of these techniques to mine valuable information from API call sequences. However, these methods typically operate on raw sequences and may not effectively capture crucial information, especially in the case of multi-process malware, due to the API call interleaving problem . Furthermore, they often fail to capture contextual behaviors within or across processes, which is particularly important for identifying and classifying malicious activities. Motivated by this, we present API2Vec++, a graph-based API embedding method for malware detection and classification. First, we construct a graph model to represent the raw sequence. Specifically, we design the Temporal Process Graph (TPG) to model inter-process behaviors and the Temporal API Property Graph (TAPG) to model intra-process behaviors. Compared to our previous graph model, the TAPG model exposes operations with associated behaviors within the process through node properties and thus enhances detection and classification abilities. Using these graphs, we develop a heuristic random walk algorithm to generate numerous paths that can capture fine-grained malicious familial behavior. By pre-training these paths using the BERT model, we generate embeddings of paths and APIs, which can then be used for malware detection and classification. Experiments on a real-world malware dataset demonstrate that API2Vec++ outperforms state-of-the-art embedding methods and detection/classification methods in both accuracy and robustness, particularly for multi-process malware.
Loading