Multi-Round Random Subspace Feature Selection for Incomplete Gene Expression Data

Will Pearson, Cao Truong Tran, Mengjie Zhang, Bing Xue

Published: 2019, Last Modified: 02 Oct 2024CEC 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Gene expression data has been successfully used for cancer classification. However, gene expression data often suffers from a large number of missing values which makes serious issues for classification. A common approach to performing classification with incomplete data is to use decision trees which can directly work with missing data. However, decision trees for gene expression data are often inaccurate due to a large number of genes (very high dimensionality) and a small number of samples in gene expression data. Feature selection is a popular way to deal with the problem. Recently, evolutionary computation techniques such as genetic algorithms (GAs) and particle swarm optimisation (PSO) have been widely used for feature selection. Nonetheless, these evolutionary techniques are often unstable and inaccurate when working with high-dimensional gene expression data. Therefore, this paper proposes a new feature selection method which divides the feature space into subspaces multiple times, and then uses evolutionary computation techniques to perform feature selection on these subspaces. Experimental results show that the proposed method not only improves the classification accuracy, but also selects far fewer and more stable features than other common feature selection methods.