Missing Value Imputation with MERCS: A Faster Alternative to MissForest

Elia Van Wolputte; Hendrik Blockeel

Missing Value Imputation with MERCS: A Faster Alternative to MissForest

Elia Van Wolputte, Hendrik Blockeel

Published: 01 Jan 2020, Last Modified: 25 Jan 2025DS 2020EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Fundamentally, many problems in Machine Learning are understood as some form of function approximation; given a dataset \(\mathcal {D}\), learn a function \(\mathop {} f_{\textit{\textbf{X}} \rightarrow \textit{\textbf{Y}}}\). However, this overlooks the ubiquitous problem of missing data. E.g., if afterwards an unseen instance has missing input variables, we actually need a function \(f_{\varvec{X'}\rightarrow \textit{\textbf{Y}}}\) with \(\textit{\textbf{X}}' \subset \textit{\textbf{X}}\) to predict its label. Strategies to deal with missing data come in three kinds: naive, probabilistic and iterative. The naive case replaces missing values with a fixed value (e.g. the mean), then uses \(\mathop {} f_{\textit{\textbf{X}} \rightarrow \textit{\textbf{Y}}}\) as if nothing was ever missing. The probabilistic case has a generative model \(\mathcal {M}\) of \(\mathcal {D}\) and uses probabilistic inference to find the most likely value of \(\textit{\textbf{Y}}\), given values for any subset of \(\textit{\textbf{X}}\). The iterative approach consists of a loop: according to some model \(\mathcal {M}\), fill in all the missing values based on the given ones, retrain \(\mathcal {M}\) on the completed data and redo your predictions, until these converge. MissForest is a well-known realization of this idea using Random Forests. In this work, we establish the connection between MissForest and MERCS (a multi-directional generalization of Random Forests). We go on to show that under certain (realistic) conditions where the retraining step in MissForest becomes a bottleneck, MERCS (which is trained only once) offers at-par predictive performance at a fraction of the time cost.

Loading