Abstract: Streaming data processing has attracted much more attention and become a key research area in the fields of machine learning and data mining. Since the distribution of real data may evolve (called concept drift) with time due to many unforeseen factors and real data is usually with imbalanced cluster/class distributions during streaming data processing, drifts occurred in distributions with fewer data objects are easily masked by the larger distributions. This paper, therefore, proposes an unsupervised drift detection approach called Multi-Imbalanced Cluster Discriminator (MICD) to address the more challenging imbalance problem of unlabeled data. It first partitions data into compact clusters, and then learns a discriminator for each cluster to detect drift. It turns out that MICD can detect drift occurrence, locate where the drift occurs, and quantify the extent of the drift. MICD is efficient, interpretable, and has easy-to-set parameters. Extensive experiments on synthetic and real datasets illustrate the superiority of MICD.
Loading