Shapley Value Based Feature Selection to Improve Generalization of Genetic Programming for High-Dimensional Symbolic Regression
Abstract: Symbolic regression (SR) on high-dimensional data is a challenging problem, often leading to poor generalization performance. While feature selection can improve the generalization ability and efficiency of learning methods, it is still a hard problem for genetic programming (GP) for high-dimensional SR. Shapley value has been used in an additive feature attribution method to attribute the difference between the output of the model and an average baseline to the input features. Owing to its solid game-theoretic principles, Shapley value has the ability to fairly compute each feature importance. In this paper, we propose a novel feature selection algorithm based on the Shapley value to select informative features in GP for high-dimensional SR. A set of experiments on ten high-dimensional regression datasets show that, compared with standard GP, the proposed algorithm has better learning and generalization performance on most of the datasets. A further analysis shows that the proposed method evolves more compact models containing highly informative features.
Loading