Augmenting Federated Learning with Pretrained Transformers

Xuechen Zhang; Mingchen Li; Xiangyu Chang; Jiasi Chen; Amit Roy-Chowdhury; Ananda Suresh; Samet Oymak

Augmenting Federated Learning with Pretrained Transformers

Xuechen Zhang, Mingchen Li, Xiangyu Chang, Jiasi Chen, Amit Roy-Chowdhury, Ananda Suresh, Samet Oymak

Published: 28 Oct 2023, Last Modified: 21 Nov 2023FL@FM-NeurIPS’23 PosterEveryoneRevisionsBibTeX

Keywords: federated learning, pretrained transformer, parameter efficiency, multitask learning

TL;DR: Pretrained transformers with modular updates overcome fundamental federated learning bottlenecks including data & communication efficiency, heterogeneity, and multi-tasking.

Abstract: The explosive growth and diversity of machine learning applications motivate a fundamental rethinking of learning with mobile and edge devices. How can we address *diverse/disparate client goals* and learn with *scarce heterogeneous data*? While federated learning (FL) aims to address these issues, it has several bottlenecks and challenges hindering a unified solution. On the other hand, large transformer models have been shown to work across a variety of tasks often achieving remarkable few-shot adaptation. This raises the question: Can FL clients use a single general-purpose model -- rather than custom models for each task -- while obeying *device and network constraints*? In this work, we investigate pretrained transformers (PTF) to achieve these on-device learning goals and thoroughly explore the roles of model size and modularity, where the latter refers to adaptation through modules such as prompts or adapters. We demonstrate that: **(1) Larger scale** shrinks the accuracy gaps between alternative approaches and improves heterogeneity robustness. Crucially, scale allows clients to run *more local SGD epochs* which substantially ($\times 4$) reduces the number of communication rounds. At the extreme, clients can achieve respectable accuracy fully-locally reducing the need for collaboration. **(2) Modularity** enables $>$100$\times$ less communication in bits. Surprisingly, it also boosts the generalization capability of local adaptation methods and the robustness of smaller PTFs. To explain these benefits, we show that scale and modularity can synergistically mitigate the *representation shift* during FL. Finally, to harness multitasking capabilities of modern PTFs, we propose FedYolo: A new FL approach that assigns both dedicated and shared modules to FL tasks to manage their interference. Our extensive experiments demonstrate FedYolo's value and the power of scale and modularity for multitasking.

Student Author Indication: Yes

Submission Number: 25

Loading