Temporal Experts Averaging for Large-scale Temporal Domain Generalization

ACL ARR 2025 May Submission670 Authors

14 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Temporal Domain Generalization (TDG) aim at generalizing across temporal distribution shifts, e.g., lexical change over time, by predicting future models. Due to the prohibitive full model prediction cost on large-scale scenarios, recent TDG works only predict the classifier, but this limits generalization potential by failing to adjust other model components. To address this, we propose Temporal Experts Averaging (TEA), a novel TDG framework based on weight averaging that adjusts the entire model to maximize generalization potential while maintaining minimal computational overhead when scaling to large-scale datasets and models. Our theoretical analysis of weight averaging for TDG guided us to develop two steps that enhance generalization to future domains. First, we create expert models with functional diversity yet parameter similarity by fine-tuning a domain-agnostic base model on individual temporal domains while constraining weight changes. Second, we optimize the bias-variance tradeoff through adaptive averaging coefficients derived from modeling temporal weight trajectories in a principal component subspace and weighting experts based on their projected proximity to future domains in the subspace. Extensive experiments across 7 TDG benchmarks, 5 models, and 2 TDG settings reports TEA outperforms prior TDG methods by up to 69\% while being up to 60x more efficient.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: Temporal Domain Generalization; Weight Averaging; Efficient Training;
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Theory
Languages Studied: English
Submission Number: 670
Loading