Attackers are not Stealthy: Statistical Analysis of the Well-Known and Infamous KDD Network Security Dataset

João Vitor Valle Silva, Martin Andreoni Lopez, Diogo M. F. Mattos

Published: 2020, Last Modified: 30 Sept 2024CIoT 2020EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Anomaly-based approaches for detecting network intrusions suffer from accurate evaluation, comparison, and deployment due to the scarcity of adequate datasets. Consequently, researchers resort to suboptimal datasets that no longer relate to a real-world network nor provide insights for current network issues, such as the DARPA'98 dataset and its variants KDD'99 and NSL-KDD. In this article, we propose a statistical study over the NSL-KDD features, and we conclude that NSL-KDD and the old KDD'99 should not be used as a benchmark for creating novel anomaly-based approaches intrusion detection systems because they introduce a biased classification, since features are over-correlated. The proposed approach analyzes the correlation among features instead of checking for redundant values or the imbalance of data. Our results align with the performance of three machine learning techniques trained to discriminate attack from normal traffic. We show that biased classification occurs because there was a high correlation between features and classes. The syntactically-generated features are statistically different between normal and attack traffic classes, which implies that, in KDD-related datasets, attackers are not stealthy.