Abstract: We study high-dimensional regression with missing entries in the covariates. A common
strategy in practice is to impute the missing entries with an appropriate substitute and then
implement a standard statistical procedure acting as if the covariates were fully observed. Recent
literature on this subject proposes instead to design a specific, often complicated or non-convex,
algorithm tailored to the case of missing covariates. We investigate a simpler approach where we
fill-in the missing entries with their conditional mean given the observed covariates. We show
that this imputation scheme coupled with standard off-the-shelf procedures such as the LASSO
and square-root LASSO retains the minimax estimation rate in the random-design setting where
the covariates are i.i.d. sub-Gaussian. We further show that the square-root LASSO remains
pivotal in this setting.
It is often the case that the conditional expectation cannot be computed exactly and must be
approximated from data. We study two cases where the covariates either follow an autoregres-
sive (AR) process, or are jointly Gaussian with sparse precision matrix. We propose tractable
estimators for the conditional expectation and then perform linear regression via LASSO, and
show similar estimation rates in both cases. We complement our theoretical results with sim-
ulations on synthetic and semi-synthetic examples, illustrating not only the sharpness of our
bounds, but also the broader utility of this strategy beyond our theoretical assumptions.
0 Replies
Loading