Feature Scoring using Tree-Based Ensembles for Evolving Data Streams

Heitor Murilo Gomes, Rodrigo Fernandes de Mello, Bernhard Pfahringer, Albert Bifet

2019 (modified: 08 Nov 2022)BigData 2019Readers: Everyone

Abstract: Assigning scores to individual features is a popular method for estimating the relevance of features in supervised learning. An accurate feature score estimation provides essential insights in sensitive domains, which is decisive to explain how features influence a given decision, contributing to the interpretability of the model. Learning from streaming data adds several challenges to machine learning tasks, including limited resources and changes to the underlying data distribution (i.e., evolving data streams). In this work, we introduce and analyze methods to efficiently estimate the Mean Decrease in Impurity (MDI) and COVER measures using ensembles of incremental decision trees. To achieve current scores in evolving data streams, we employ tree-ensembles that incorporate active drift detection. Experimental results show how MDI and COVER can be used to track the feature scores when their importance to the ensemble model shift over time. On top of that, we present the impact on the feature scores when the learning problem includes a non-negligible verification latency for the arrival of the labels. We also present a counter-intuitive experiment using a standard benchmark dataset where the feature scores correctly illustrate the importance of two features to the ensemble model. However, these features are prioritized due to biased split decisions, and in their absence, the model increases in predictive performance. We conclude that the presented measures can be used to understand the impact of features in the ensemble model better, still, such measures should be used with caution as they are limited by the underlying tree building and ensemble model biases.

0 Replies