Multiple imputation and genetic programming for classification with incomplete data

Cao Truong Tran, Mengjie Zhang, Peter Andreae, Bing Xue

Published: 2017, Last Modified: 02 Oct 2024GECCO 2017EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Many industrial and research datasets suffer from an unavoidable issue of missing values. One of the most common approaches to solving classification with incomplete data is to use an imputation method to fill missing values with plausible values before applying classification algorithms. Multiple imputation is a powerful approach to estimating missing values, but it is very expensive to use multiple imputation to estimate missing values for a single instance that needs to be classified. Genetic programming (GP) has been widely used to construct classifiers for complete data, but it seldom has been used for incomplete data. This paper proposes an approach to combining multiple imputation and GP to evolve classifiers for incomplete data. The proposed method uses multiple imputation to provide a high quality training data. It also searches for common patterns of missing values, and uses GP to build a classifier for each pattern of missing values. Therefore, the proposed method generates a set of classifiers that can be used to directly classify any new incomplete instance without requiring imputation. Experimental results show that the proposed method not only can be faster than other common methods for classification with incomplete data but also can achieve better classification accuracy.