PolyViT: Co-training Vision Transformers on Images, Videos and Audio
Abstract: Can we train a single transformer model capable of processing multiple modalities and datasets, whilst sharing almost all of its learnable parameters? We present PolyViT, a model trained on images, audio and video to answer this question. PolyViT consists of a single transformer backbone, modality-specific tokenizers and task-specific output heads. By co-training on different tasks of a single modality, we are able to achieve significant accuracy improvements on 5 standard video- and audio-classification datasets. Furthermore, co-training PolyViT on multiple modalities and tasks leads to a parameter-efficient model which generalizes across multiple domains. In particular, our multi-modal PolyViT trained on 9 datasets across 3 modalities uses 8.3 times fewer parameters and outperforms a state-of-the-art single-task baseline on 2 of these datasets, whilst achieving competitive performance on the others. Finally, this simple and practical approach necessitates less hyperparameter tuning as the per-task hyperparameters can be readily reused.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: First revision: - Updated the abstract to address reviewer comment - In Section 2, update the related work to mention Perceiver-IO and Unified-IO. Also, update description of MultiModel. - In Section 4, clarify that our method applies to any first-order gradient-based optimiser. Also add detail about gradient accumulation. - In Section 5, correct bold-facing in audio results table. - In Section 6, Add discussion of limitations to the conclusion. - Add additional experiment on up-stream co-training with images and text into the appendix. Camera ready: Add links to the code release
Assigned Action Editor: ~Yanwei_Fu2
Submission Number: 404