Wild-Time: A Benchmark of in-the-Wild Distribution Shift over TimeDownload PDF

25 May 2022, 07:17 (modified: 15 Oct 2022, 20:10)NeurIPS 2022 Datasets and Benchmarks Readers: Everyone
Keywords: temporal distribution shift, invariant learning, continual learning
TL;DR: A new benchmark for in-the-wild distribution shift over time
Abstract: Distribution shifts occur when the test distribution differs from the training distribution, and can considerably degrade performance of machine learning models deployed in the real world. While recent works have studied robustness to distribution shifts, distribution shifts arising from the passage of time have the additional structure of timestamp metadata. Real-world examples of such shifts are underexplored, and it is unclear whether existing models can leverage trends in past distribution shifts to reliably extrapolate into the future. To address this gap, we curate Wild-Time, a benchmark of 5 datasets that reflect temporal distribution shifts arising in a variety of real-world applications, including drug discovery, patient prognosis, and news classification. On these datasets, we systematically benchmark 13 approaches with various inductive biases. We evaluate methods in domain-generalization, continual learning, self-supervised learning, and ensemble learning, which leverage timestamps to extract the common structure of the distribution shifts. We extend several domain-generalization methods to the temporal distribution shift setting by treating windows of time as different domains. Finally, we propose two evaluation strategies to evaluate model performance under temporal distribution shifts---evaluation with a fixed time split (Eval-Fix) and evaluation with a data stream (Eval-Stream). Eval-Fix, our primary evaluation strategy, aims to provide a simple evaluation protocol for the broader machine learning community, while Eval-Stream serves as a complementary benchmark for continual learning approaches. Our experiments demonstrate that existing methods are limited in tackling temporal distribution shift: across all settings, we observe an average performance drop of 20% from in-distribution to out-of-distribution data.
Supplementary Material: pdf
URL: https://github.com/huaxiuyao/Wild-Time
Open Credentialized Access: Our benchmark includes MIMIC dataset, which requires PhysioNet credentialing for use of human subject data.
Dataset Url: https://github.com/huaxiuyao/Wild-Time
License: Source code of the Wild-Time benchmark uses MIT license.
Author Statement: Yes
Contribution Process Agreement: Yes
In Person Attendance: Yes
33 Replies