Abstract: In this paper, we propose a framework for finetuning vision transformer models to learn attention in distributed settings, where the computational nodes communicate through a peer-to-peer network. The nodes are not allowed to share their private training dataset, but they can share some model parameters with the neighboring nodes. We discuss how the proposed framework helps each node acquire global understanding and attention using only its local dataset. We address the problem of parameter-efficient fine-tuning of large transformer models for downstream learning tasks and demonstrate that our proposed framework enables each computational node to achieve performance comparable to that of a single central device with access to the entire training dataset. We present the fine-tuning results for ViT, DeiT, and Swin-transformer models on a variety of datasets, and we also show their attention maps to provide insights into the distributed learning process of transformers.
External IDs:dblp:conf/eusipco/QureshiK25
Loading