PolyViT: Co-training Vision Transformers on Images, Videos and Audio

Published: 09 Jan 2023, Last Modified: 28 Feb 2023Accepted by TMLREveryoneRevisionsBibTeX
Abstract: Can we train a single transformer model capable of processing multiple modalities and datasets, whilst sharing almost all of its learnable parameters? We present PolyViT, a model trained on images, audio and video to answer this question. PolyViT consists of a single transformer backbone, modality-specific tokenizers and task-specific output heads. By co-training on different tasks of a single modality, we are able to achieve significant accuracy improvements on 5 standard video- and audio-classification datasets. Furthermore, co-training PolyViT on multiple modalities and tasks leads to a parameter-efficient model which generalizes across multiple domains. In particular, our multi-modal PolyViT trained on 9 datasets across 3 modalities uses 8.3 times fewer parameters and outperforms a state-of-the-art single-task baseline on 2 of these datasets, whilst achieving competitive performance on the others. Finally, this simple and practical approach necessitates less hyperparameter tuning as the per-task hyperparameters can be readily reused.
Code: https://github.com/google-research/scenic/tree/main/scenic/projects/polyvit
