Abstract: Segmented regression is a statistical method that approximates a function $f$ by a piecewise function $\hat{f}$ using noisy data samples.
*Min-$\epsilon$* approaches aim to reduce the regression function's mean squared error (MSE) for a given number of $k$ segments.
An optimal solution for *min-$\epsilon$* segmented regression is found in $\mathcal{O}(n^2)$ time (Bai & Perron, 1998; Yamamoto & Perron, 2013) for $n$ samples. For large datasets, current heuristics improve time complexity to $\mathcal{O}(n\log{n})$ (Acharya et al., 2016) but can result in large errors, especially when exactly $k$ segments are used.
We present a method for *min-$\epsilon$* segmented regression that combines the scalability of top existing heuristic solutions with a statistical efficiency similar to the optimal solution. This is achieved by using a new method to merge an initial set of segments using precomputed matrices from samples, allowing both merging and error calculation in constant time.
Our approach, using the same samples and parameter $k$, produces segments with up to 1,000 times lower MSE compared to Acharya et al. (2016) in about 100 times less runtime on data sets over $10^4$ samples.
Lay Summary: A well-known and fundamental technique of machine learning and statistical analysis is regression. Given a dataset of samples with a known input value and a correlated measured result, it is possible to derive a function that approximates the correlation between these two variables as closely as possible. This can be used to either derive knowledge about the underlying data or to predict the output value for unseen input values. In some use cases, the dataset contains ordered samples and suddenly changes behavior at a certain point. Correctly detecting these breakpoints is quite challenging. Current state-of-the-art algorithms are either exceedingly compute-heavy with a growing number of samples or result in a significantly worse regression function.
In this paper, we present and evaluate a new algorithm for segmented regression. In our evaluation, this new approach needed much fewer computational resources -- even compared to the other heuristics -- while resulting in regressions very close to the optimal solution, without creating additional breakpoints. Given that more data enables more precise models, we think that our approach enables faster analysis and much more precise models in many different fields, including time-series analysis, ecology, econometrics, and gene analysis.
Link To Code: https://github.com/Loesgar/mvsr/tree/paper-icml-25
Primary Area: General Machine Learning
Keywords: Regression, Segmented Regression, Time-Series Analysis
Submission Number: 16316
Loading