Malware Family Classification with Explainable BERT (xBERT) Using API Calls

Ruba Kharsa, Fatih Kurugollu, Ashiq Anjum, Abbes Amira, Ahmed Bouridane

Published: 2024, Last Modified: 27 Feb 2026BDCAT 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Malicious Software (Malware) is a primary element of many cyber crimes and attacks, causing massive damage and financial losses to organizations. Accordingly, malware detection and classification has become a crucial security field resulting in various attempts from researchers to develop solutions including signature-based approaches to Artificial Intelligence (AI) models showing their efficacy in detecting malware. Yet, users still have reservations about AI models due to the ambiguity and mysteriousness of their decisions resulting from their black-box nature. To address this problem, this paper proposes to develop explainable AI models to classify the malware families robustly. The proposed method uses text classification of API call sequences generated by these families by considering two datasets. A weighted training methodology is used to solve the dataset imbalance problem. Subsequently, the method presents an eXplainable AI (XAI) approach to establish an understandable and interpretable relationship between the API call sequences and the Bidirectional Encoder Representations from Transformers (BERT) model decisions, which enhance the model accountability and usability by employing the Local Interpretable Model-Agnostic Explanation (LIME) and the Shapley Additive Explanations (SHAP) platforms. The results reveal that the BERT model outperforms its counterparts considering F1 score, Balanced Accuracy (BA), and Matthews correlation coefficient (MCC).
Loading