The Power of Minimalism in Long Sequence Time-series Forecasting

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Long-term time series forecasting, Transformers, Efficiency
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Recently, transformer-based models have been widely applied to time series forecasting tasks due to their remarkable capability to capture complex interactions within sequential data. However, as the sequence length expands, Transformer-based models suffer from increased memory consumption, overfitting, and performance deterioration in capturing long-range dependencies. Recently, several studies have shown that MLP-based models can outperform advanced Transformer-based models for long-term time series forecasting (LTSF) tasks. Unfortunately, linear mappings often struggle to capture intricate dependencies when handling multivariate time series. Although modeling each channel independently can alleviate this issue, it will significantly increase the computational cost. To this end, we introduce a set of simple yet effective depthwise convolution models named LTSF-Conv to perform LTSF. Specifically, we apply unique filters to each channel to achieve channel independence, which plays a pivotal role in enhancing overall forecasting performance. Experimental results show that LTSF-Conv models outperform the state-of-the-art Transformer-based and MLP-based models across seven real-world LTSF benchmarks. Surprisingly, a two-layer non-stacked network can outperform the state-of-the-art Transformer model in 91\% of cases with a significant reduction of computing resources. In particular, LTSF-Conv models substantially decrease the average number of trainable parameters (by $\sim$ 12$\times$), maximum memory consumption (by $\sim$ 86$\times$), running time (by $\sim$ 18$\times$), and inference time (by $\sim$ 2$\times$) on the Electricity benchmark.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6882
Loading