Abstract: Recent research on Internet traffic classification has yield a
number of data mining techniques for distinguishing types
of traffic, but no systematic analysis on “Why" some algorithms achieve high accuracies. In pursuit of empirically
grounded answers to the “Why" question, which is critical in
understanding and establishing a scientific ground for traffic
classification research, this paper reveals the three sources
of the discriminative power in classifying the Internet application traffic: (i) ports, (ii) the sizes of the first one-two (for
UDP flows) or four-five (for TCP flows) packets, and (iii) discretization of those features. We find that C4.5 performs the
best under any circumstances, as well as the reason why; because the algorithm discretizes input features during classification operations. We also find that the entropy-based Minimum Description Length discretization on ports and packet
size features substantially improve the classification accuracy of every machine learning algorithm tested (by as much
as 59.8%!) and make all of them achieve >93% accuracy
on average without any algorithm-specific tuning processes.
Our results indicate that dealing with the ports and packet
size features as discrete nominal intervals, not as continuous
numbers, is the essential basis for accurate traffic classification (i.e., the features should be discretized first), regardless
of classification algorithms to use.
Loading