Random feature baselines provide distributional performance and feature selection benchmarks for clinical and 'omic machine learning
Abstract: Identifying predictive features from highdimensional datasets is a major task in biomedical research. However, it is difficult to determine the robustness of selected features. Here,
we investigate the performance of randomly
chosen features, what we term “random feature baselines” (RFBs), in the context of disease
risk prediction from blood plasma proteomics
data in the UK Biobank. We examine two published case studies predicting diagnosis of (1)
dementia and (2) hip fracture. RFBs perform
similarly to published proteins of interest (using the same number, randomly chosen). We
then measure the performance of RFBs for all
607 disease outcomes in the UK Biobank, with
various numbers of randomly chosen features,
as well as all proteins in the dataset. 114/607
outcomes showed a higher mean AUROC when
choosing 5 random features than using all proteins, and the absolute difference in mean AUC
was 0.075. 163 outcomes showed a higher mean
AUROC when choosing 1000 random features
than using all proteins, and the absolute difference in mean AUC was 0.03. Incorporating
RFBs should become part of ML practice when
feature selection or target discovery is a goal.
Loading