DGSM-SCAM-GAT and MMT-ViT: Multimodal and Graph-Based Malware Detection

ZhaoNa; Pan Wei

DGSM-SCAM-GAT and MMT-ViT: Multimodal and Graph-Based Malware Detection

ZhaoNa, Pan Wei

14 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Malware Detection*Graph Attention Networks*Dynamic Gating Mechanisms*Multi-Modal Fusion*Transfer Learning

TL;DR: This paper proposes DGSM-SCAM-GAT and MMT-ViT, two complementary models for robust malware detection via graph-based dynamic API analysis and multi-modal static feature fusion, achieving superior accuracy and F1-scores on benchmark datasets.

Abstract: Malware detection encounters substantial challenges in real-time and multi-class tasks, as single-modality methods struggle to capture intricate behavioral patterns. To mitigate these limitations, we introduce two complementary models: DGSM-SCAM-GAT and MMT-ViT. The DGSM-SCAM-GAT model integrates dynamic gating, contextual aggregation mechanisms, and graph attention networks (GAT) to enhance temporal and structural modeling of API call sequences. Trained on a dynamic API call sequence dataset, it attains an accuracy of 99.31\% and an F1-score of 99.64\%, surpassing CNN-LSTM (accuracy: 98.92\%). The MMT-ViT model employs multi-modal attention mechanisms and the pre-trained ViT architecture to effectively fuse features from assembly instruction sequences, binary grayscale images, and binary wavelet sequence features. Evaluated on a public dataset, it achieves 99.54\% accuracy and 99.55\% F1-score, outperforming Malcse (accuracy: 98.94\%). Furthermore, ablation studies validate the critical contributions of individual modules, while comparative experiments underscore the superiority of our proposed models over state-of-the-art baselines. The detection frameworks developed in this study facilitate robust dynamic and static malware identification, with code available in the supplementary materials.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 5118

Loading