A More Realistic Evaluation of Cross-Frequency Transfer Learning and Foundation Forecasting Models

Kin G. Olivares; Malcolm Wolff; Tatiana Konstantinova; Shankar Ramasubramanian; Boris N. Oreshkin; Andrew Gordon Wilson; Andres Potapczynski; Willa Potosnak; Michael W. Mahoney; Mengfei Cao; Dmitry Efimov

A More Realistic Evaluation of Cross-Frequency Transfer Learning and Foundation Forecasting Models

Kin G. Olivares, Malcolm Wolff, Tatiana Konstantinova, Shankar Ramasubramanian, Boris N. Oreshkin, Andrew Gordon Wilson, Andres Potapczynski, Willa Potosnak, Michael W. Mahoney, Mengfei Cao, Dmitry Efimov

Published: 23 Sept 2025, Last Modified: 19 Nov 2025BERT2SEveryoneRevisionsBibTeXCC BY 4.0

Keywords: forecasting, probabilistic forecasting, tranfer learning, cross-frequency, benchmarking

TL;DR: Current TSFM benchmarks are flawed; using 15 large scale, leak-free datasets we show statistical models still outperform FFMs, though synthetic pre-training narrows the gap.

Abstract: Cross-frequency transfer learning (CFTL) has emerged as a popular framework for curating large-scale time series datasets to pre-train foundation forecasting models (FFMs). Although CFTL has shown promise, current benchmarking practices fall short of accurately assessing its performance. This shortcoming stems from many factors: an over-reliance on miniature-scale evaluation datasets; inadequate treatment of sample size when computing summary statistics; reporting of suboptimal statistical models; and failing to account for non-negligible risks of overlap between pre-training and test datasets. To address these limitations, we introduce a unified reimplementation of widely-adopted neural forecasting networks, adapting them for the CFTL setup; we pre-train only on proprietary and synthetic data, being careful to prevent test leakage; and we evaluate on 15 large, diverse public forecast competition datasets. Our empirical analysis reveals that statistical models' accuracy is frequently underreported. Notably, we confirm that statistical models and their ensembles consistently outperform existing FFMs by more than 8.2% in sCRPS, and by more than 20% MASE, across datasets. However, we also find that synthetic dataset pre-training does improve the accuracy of a FFM by 7% percent.

Submission Number: 14

Loading