Statistical Test for Feature Selection Pipelines by Selective Inference

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We introduce a statistical test for data analysis pipeline in feature selection problems, which allows for the systematic development of valid statistical tests applicable to any pipeline configuration composed of a set of predefined components.
Abstract: A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating various analysis algorithms. In this paper, we propose a novel statistical test to assess the significance of data analysis pipelines. Our approach enables the systematic development of valid statistical tests applicable to any feature selection pipeline composed of predefined components. We develop this framework based on selective inference, a statistical technique that has recently gained attention for data-driven hypotheses. As a proof of concept, we focus on feature selection pipelines for linear models, composed of three missing value imputation algorithms, three outlier detection algorithms, and three feature selection algorithms. We theoretically prove that our statistical test can control the probability of false positive feature selection at any desired level, and demonstrate its validity and effectiveness through experiments on synthetic and real data. Additionally, we present an implementation framework that facilitates testing across any configuration of these feature selection pipelines without extra implementation costs.
Lay Summary: In practical data analysis, applying a single algorithm to raw datasets is rarely sufficient. Analysts typically construct complex pipelines that integrate multiple algorithms to extract deeper insights. However, the intricacy of these pipelines often makes it difficult to determine whether the results are genuinely meaningful or merely the result of random fluctuations. To address this challenge, we introduce a statistical testing framework designed to evaluate the reliability of such results. As a proof of concept, we focus on pipelines that perform feature selection while handling missing data and detecting outliers. For these specific scenarios, we have developed customized statistical tests to rigorously assess the significance of the selected features. Our testing methodology provides a quantitative evaluation of feature importance, thereby increasing confidence in the analytical outcomes. Through extensive numerical experiments, we demonstrate that our approach facilitates the development of more robust and reliable analytical pipelines. Furthermore, we provide user-friendly software tools to support the application of these tests across a wide range of data analysis workflows.
Link To Code: https://github.com/Takeuchi-Lab-SI-Group/si4pipeline
Primary Area: Theory->Probabilistic Methods
Keywords: Data Analysis Pipeline, AutoML, Statistical Test, Selective Inference, Missing Value Imputation, Outlier Detection, Feature Selection
Submission Number: 13925
Loading