Statistical interpretation of machine learning-based feature importance scores for biomarker discovery.

Vân Anh Huynh-Thu, Yvan Saeys, Louis Wehenkel, Pierre Geurts

2012 (modified: 09 Nov 2022)Bioinform.2012Readers: Everyone

Abstract: Univariate statistical tests are widely used for biomarker discovery in bioinformatics. These procedures are simple, fast and their output is easily interpretable by biologists but they can only identify variables that provide a significant amount of information in isolation from the other variables. As biological processes are expected to involve complex interactions between variables, univariate methods thus potentially miss some informative biomarkers. Variable relevance scores provided by machine learning techniques, however, are potentially able to highlight multivariate interacting effects, but unlike the p-values returned by univariate tests, these relevance scores are usually not statistically interpretable. This lack of interpretability hampers the determination of a relevance threshold for extracting a feature subset from the rankings and also prevents the wide adoption of these methods by practicians.

0 Replies