Abstract: Training a deep neural network-based intrusion detection system requires a large amount of clean labeled data, yet malicious traffic datasets are usually collected from the open-source web community or simulated attack environments, which inevitably contain a large portion of unreliably labeled traffic data. The state-of-the-art methods dealing with label noise combine sample separation and semi-supervised learning (SSL), however, they are hardly usable in the traffic field because traffic data lacks a reasonable data augmentation like image data. To this end, we propose a generic label-noise-resistant framework for malicious traffic detection called Gedss. Unlike previous approaches focusing on data augmentation, our approach improves model performance by enhancing the quality of sample selection and model decision boundaries. The framework contains two parts: sample selection and semi-supervised learning. The sample selection method is presented to divide the original traffic instances into clean ones (labeled set) and noisy ones (unlabeled set). We fit a Jensen-Shannon divergence-based sample prediction loss to a mixture model as the criterion, and the threshold is automatically and dynamically adjusted, which makes our selection mechanism adaptive to various malicious traffic datasets. Besides, a semi-supervised learning method is designed, which uses two networks to jointly predict the pseudo label of the unlabeled set. Considering the class imbalance of divided labeled data, the idea of fine-tuning the models with all data is presented to improve the performance of SSL. Extensive experimental results under different label noise scenarios demonstrate that our approach outperforms state-of-the-art methods.
Loading