All-Task Convergence and Backward Transfer in Federated Domain-Incremental Learning with Partial Participation

Longtao Xu; Jian Li

All-Task Convergence and Backward Transfer in Federated Domain-Incremental Learning with Partial Participation

Longtao Xu, Jian Li

06 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Federated Domain-Incremental Learning, Continual Learning, Partial Participation, Global Convergence Rate

TL;DR: SPECIAL: a one-line change to FedAvg that provably preserves past tasks and achieves all-task convergence in federated domain-incremental learning with partial participation.

Abstract: Real-world federated systems seldom operate on static data: input distributions drift while privacy rules forbid raw-data sharing. We study this setting as Federated Domain-Incremental Learning (FDIL), where (i) clients are heterogeneous, (ii) tasks arrive sequentially with shifting domains, yet (iii) the label space remains fixed. Two theoretical pillars remain missing for FDIL under realistic deployment: a guarantee of backward knowledge transfer (BKT) and a convergence rate that holds across the sequence of *all* tasks with ***partial participation***. We introduce SPECIAL (Server-Proximal Efficient Continual Aggregation for Learning), a simple, memory-free FDIL algorithm that adds a single server-side ``anchor'' to vanilla FedAvg: in each round, the server nudges the uniformly sampled participated clients update toward the previous global model with a lightweight proximal term. This anchor curbs cumulative drift without replay buffers, synthetic data, or task-specific heads, keeping communication and model size unchanged. Our theory shows that SPECIAL (i) *preserves earlier tasks:* a BKT bound caps any increase in prior-task loss by a drift-controlled term that shrinks with more rounds, local epochs, and participating clients; and (ii) *learns efficiently across all tasks:* the first communication-efficient non-convex convergence rate for FDIL with partial participation, $\mathcal{O}(\sqrt{E/\left(NT\right)})$, with $E$ local epochs, $T$ communication rounds, and $N$ participated clients per round, matching single-task FedAvg while explicitly separating optimization variance from inter-task drift. Experimental results further demonstrate the effectiveness of SPECIAL.

Supplementary Material: zip

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 2521

Loading