Ab Initio Nonparametric Variable Selection for Scalable Symbolic Regression with Large $p$

Shengbin Ye; Meng Li

Ab Initio Nonparametric Variable Selection for Scalable Symbolic Regression with Large $p$

Shengbin Ye, Meng Li

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: PAN+SR is a scalable symbolic regression framework that uses nonparametric variable selection to efficiently handle high-dimensional datasets, improving performance across multiple SR methods.

Abstract: Symbolic regression (SR) is a powerful technique for discovering symbolic expressions that characterize nonlinear relationships in data, gaining increasing attention for its interpretability, compactness, and robustness. However, existing SR methods do not scale to datasets with a large number of input variables (referred to as extreme-scale SR), which is common in modern scientific applications. This "large $p$'' setting, often accompanied by measurement error, leads to slow performance of SR methods and overly complex expressions that are difficult to interpret. To address this scalability challenge, we propose a method called PAN+SR, which combines a key idea of ab initio nonparametric variable selection with SR to efficiently pre-screen large input spaces and reduce search complexity while maintaining accuracy. The use of nonparametric methods eliminates model misspecification, supporting a strategy called parametric-assisted nonparametric (PAN). We also extend SRBench, an open-source benchmarking platform, by incorporating high-dimensional regression problems with various signal-to-noise ratios. Our results demonstrate that PAN+SR consistently enhances the performance of 19 contemporary SR methods, enabling several to achieve state-of-the-art performance on these challenging datasets.

Lay Summary: Scientists often want to understand how different factors relate to each other by finding clear, math-based rules in their data. Symbolic regression (SR) is a technique that does exactly this—it searches for equations that explain patterns, which can lead to new scientific insights. But SR struggles when there are too many variables, which is common in fields like biology, physics, or climate science. Too many inputs make the search very slow and the resulting equations hard to understand. Our method, called PAN+SR, helps SR focus on just the most important variables before trying to find equations. It does this using a model-free filtering step that’s flexible and avoids strong assumptions. We also built a new set of benchmark problems that reflect the messy, high-dimensional data real scientists deal with. PAN+SR improves the performance of many existing SR tools and helps them find better, simpler equations more quickly. This makes it easier for researchers to use symbolic regression in real-world science, where both accuracy and interpretability matter.

Link To Code: https://github.com/mattsheng/PAN_SR

Primary Area: General Machine Learning->Supervised Learning

Keywords: Symbolic regression, Nonparametric variable selection, Extreme-scale datasets

Submission Number: 8524

Loading