Self-Training of Cyber-Threat Classification Model With Threat-Payload Centric Augmentation

Jae-Yeol Kim, Hyuk-Yoon Kwon

Published: 2024, Last Modified: 19 Feb 2025IEEE Trans. Ind. Informatics 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Deep learning (DL)-based threat classification has been investigated for effective analysis of threat events to minimize the human's resources in security operation centers (SOC). However, human-labeling (HL) by SOC security analysts is still necessary for accurate classification and responses to the unknown threat events or new threat trends. This labeling process consumes significant time and effort, posing limitations in constructing an efficient SOC response system, especially for immediate responses to newly generated large-scale threats. To address this, we propose a new self-training method of threat classification model, PLC-TPA. We present a self-training pipeline based on pseudo-labeling with confidence (PLC) for automatic labeling of newly captured threats. To resolve the class imbalance during self-training, we present a novel threat-payload centric augmentation (TPA) method considering threat-payload characteristics. Through extensive experiments, we show that PLC-TPA achieves a high accuracy of threat classification about 0.973 to 0.988 of F1-score, which improves other self-training methods by 10.9% to 13.4%. Notably, PLC-TPA performs comparable even to HL with significantly faster response times. These findings suggest substantial improvements in DL-based SOC environments with the proposed PLC-TPA. PLC-TPA also outperforms the existing methods by 8.3% to 17.4% in comparative experiments.