Abstract: Deep learning (DL) techniques have been widely applied in detecting malicious activities from network traffic. However, it is challenging to collect a traffic dataset with sufficient correct labels. The generalization ability of DL-based malicious traffic detection systems decreases when training with mislabeled data. Therefore, several methods have been proposed to detect malicious traffic from low-quality labeled training data. These methods divide noisy and clean samples based on the divergence of their prediction loss. However, this simple criterion is not effective on traffic data due to the obfuscation and redundancy nature of malicious traffic. In this paper, we propose a novel two-stage framework for malicious traffic detection from low-quality training data, which mainly consists of noisy sample filtering and label refinement. Firstly, with the help of the small loss criterion, we filter out most of the noisy samples from training data while ensuring that the filtered dataset covers sufficient clean samples. Next, we introduce a double-constrained similarity rule to provide a comprehensive measure of the similarity between samples and construct a topological graph. Lastly, we exploit the topological relations extracted from this graph to refine the labels based on the neighbor consistency criterion. We validate the effectiveness of our framework with a real-world malicious traffic dataset, achieving an accuracy of 90% even with 80% symmetric noise labels. Additionally, results from the publicly available BoT-IoT dataset demonstrate the adaptability of our framework to Internet of Things (IoT) environments.
Loading