GP-based Feature Selection and Weighted KNN-based Instance Selection for Symbolic Regression with Incomplete Data

Baligh Al-Helali, Qi Chen, Bing Xue, Mengjie Zhang

Published: 2020, Last Modified: 20 Nov 2024SSCI 2020EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Data incompleteness is one of the serious challenges in symbolic regression particularly when learning from real-world data. To handle this situation, the imputation approach works by replacing the missing values with estimated predictions. One popular imputation method is based on K-nearest neighbour (KNN). However, this method requires the use of the training data in the application process of imputing unseen data. This requirement means more computation time, especially in the case of large-scale data sets. To address this issue, this work proposes a combination of genetic programming-based feature selection and KNN-based instance selection for imputing incomplete data efficiently while constructing effective symbolic regression models. The experimental work is conducted on real-world data sets with different missingness scenarios and the obtained results show the empirical soundness of the proposed method compared to the benchmark methods.