DistilSR: A Distilled Version of Gene Expression Programming Symbolic Regression

Published: 01 Jan 2023, Last Modified: 28 Sept 2024GECCO Companion 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Symbolic Regression (SR) is the task of finding closed-form expressions that describe the relationship between variables in a dataset. Current SR methods tend to neglect a large portion of the search space of 'short' expressions in favor of longer expressions which are less explainable. In contrast to current SR methods, we propose to prioritize expression length over prediction performance. We do so by systematically searching through the search space of 'short' expressions, utilizing K-expressions from Gene Expression Programming. However, the search space of 'short' expressions is large, scaling approximately exponentially with the number of variables in a dataset. To reduce the size of the search space, we propose a method, termed DistilSR, which replaces terminal symbols with weighted linear combinations of variables. We show that DistilSR exactly recovers the ground-truth equation of 16 synthetic datasets 100% of the time, outperforming 14 benchmark SR methods in SRBench. DistilSR also shows outperformance on 14 real-world datasets when compared against 14 benchmark SR algorithms and 4 benchmark non-SR algorithms from SRBench. These equations were also consistently shorter. Finally, to further enforce sparsity of weights, we propose a method of actively setting uninfluential weights to 0, achieving even shorter expressions with competitive prediction performance.
Loading