Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time

Huaxiu Yao; Caroline Choi; Bochuan Cao; Yoonho Lee; Pang Wei Koh; Chelsea Finn

Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time

Huaxiu Yao, Caroline Choi, Bochuan Cao, Yoonho Lee, Pang Wei Koh, Chelsea Finn

Published: 17 Sept 2022, Last Modified: 20 Apr 2025NeurIPS 2022 Datasets and Benchmarks Readers: Everyone

Keywords: temporal distribution shift, invariant learning, continual learning

Abstract: Distribution shifts occur when the test distribution differs from the training distribution, and can considerably degrade performance of machine learning models deployed in the real world. While recent works have studied robustness to distribution shifts, distribution shifts arising from the passage of time have the additional structure of timestamp metadata. Real-world examples of such shifts are underexplored, and it is unclear whether existing models can leverage trends in past distribution shifts to reliably extrapolate into the future. To address this gap, we curate Wild-Time, a benchmark of 5 datasets that reflect temporal distribution shifts arising in a variety of real-world applications, including drug discovery, patient prognosis, and news classification. On these datasets, we systematically benchmark 13 approaches with various inductive biases. We evaluate methods in domain-generalization, continual learning, self-supervised learning, and ensemble learning, which leverage timestamps to extract the common structure of the distribution shifts. We extend several domain-generalization methods to the temporal distribution shift setting by treating windows of time as different domains. Finally, we propose two evaluation strategies to evaluate model performance under temporal distribution shifts---evaluation with a fixed time split (Eval-Fix) and evaluation with a data stream (Eval-Stream). Eval-Fix, our primary evaluation strategy, aims to provide a simple evaluation protocol for the broader machine learning community, while Eval-Stream serves as a complementary benchmark for continual learning approaches. Our experiments demonstrate that existing methods are limited in tackling temporal distribution shift: across all settings, we observe an average performance drop of 20% from in-distribution to out-of-distribution data.

Open Credentialized Access: Our benchmark includes MIMIC dataset, which requires PhysioNet credentialing for use of human subject data.

Author Statement: Yes

Dataset Url: https://github.com/huaxiuyao/Wild-Time

License: Source code of the Wild-Time benchmark uses MIT license.

Supplementary Material: pdf

TL;DR: A new benchmark for in-the-wild distribution shift over time

URL: https://github.com/huaxiuyao/Wild-Time

Contribution Process Agreement: Yes

In Person Attendance: Yes

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 4 code implementations](https://www.catalyzex.com/paper/wild-time-a-benchmark-of-in-the-wild/code)

33 Replies

Loading