Expression Sampler as a Dynamic Benchmark for Symbolic Regression

Published: 28 Oct 2023, Last Modified: 09 Dec 2023NeurIPS2023-AI4Science PosterEveryoneRevisionsBibTeX
Keywords: symbolic regression, benchmark, dataset
Abstract: Equation discovery, the problem of identifying mathematical expressions from data, has witnessed the emergence of symbolic regression (SR) techniques aided by benchmarking systems like SRbench. However, these systems are limited by their reliance on static expressions and datasets, which, in turn, provides limited insight into the circumstances under which SR algorithms perform well versus fail. To address this issue, we introduce an open-source method for generating comprehensive SR datasets via random sampling of mathematical expressions. This method enables dynamic expression sampling while controlling for various expression characteristics pertaining to expression complexity. The method also allows for using prior information about expression distributions, for example, to simulate expression distributions for a specific scientific domain. Using this dynamic benchmark, we demonstrate that the overall performance of established SR algorithms decreases with expression complexity and provide insight into which equation features are best recovered. Our results suggest that most SR algorithms overestimate the number of expression tree nodes and trigonometric functions and underestimate the number of input variables present in the ground truth.
Submission Track: Original Research
Submission Number: 135