When Pointwise Forecast Errors Are Not Enough: An Empirical Study of Temporal Alignment Metrics for Time Series Forecasting

When Pointwise Forecast Errors Are Not Enough: An Empirical Study of Temporal Alignment Metrics for Time Series Forecasting

TMLR Paper8764 Authors

04 May 2026 (modified: 22 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Mean squared error (MSE) and mean absolute error (MAE) are the standard metrics used to evaluate time series forecasting models. Although these metrics are useful, they compare predictions and ground truth at fixed timestamps and can miss important failures on rapidly varying series. In particular, a model may obtain a strong MSE or MAE while smoothing sharp peaks, missing deep troughs, shifting ridges in time, or delaying abrupt changes. This paper studies this issue empirically by evaluating five forecasting models: DLinear, PatchTST, TimeMixer, iTransformer, and Chronos-2; using MSE, MAE, Dynamic Time Warping (DTW), and the Temporal Distortion Index (TDI). We compare these metrics on standard forecasting benchmarks and scientific network telemetry from ESnet, with emphasis on cases where local extrema and short-term temporal structure are important. Our results show that pointwise errors can give an incomplete view of model behavior: some forecasts score well under MSE and MAE while visibly smoothing or shifting peaks and troughs, whereas other forecasts better preserve local structure but receive worse pointwise scores. DTW and TDI help expose these differences by measuring shape similarity and temporal misalignment, respectively. We do not argue that DTW and TDI should replace MSE and MAE or that they are sufficient for every forecasting task. Rather, we show that they are useful diagnostic metrics when the timing and shape of peaks, troughs, and ridges matter.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Giannis_Nikolentzos1

Submission Number: 8764

Loading