Keywords: Time Series Foundation Model, Forecasting, Benchmark
Abstract: Recent work on foundation models for time series forecasting has been accompanied by benchmarks such as GIFT-Eval, which aim to standardize comparison and establish leaderboards. These studies typically include simple baselines such as Seasonal Naïve or DLinear, establishing a low bar that new foundation models are expected to surpass. However, we show that this bar can be substantially raised: with careful tuning, a vanilla linear regression model achieves surprisingly strong performance, outperforming many deep learning methods (e.g., iTransformer) and even popular foundation models such as Chronos Base. This finding highlights the need to recalibrate evaluation practices in time series forecasting, both by adopting stronger baselines that meaningfully challenge foundation models and by incorporating more diverse, non-linear datasets. We argue that linear regression can serve as a litmus test for benchmark design, revealing that current evaluation practices may obscure progress in foundation model forecasting.
Submission Number: 39
Loading