Abstract: Can we train a single transformer model capable of processing multiple modalities and datasets, whilst sharing almost all of its learnable parameters? We present PolyViT, a model trained on images, audio and video to answer this question. PolyViT consists of a single transformer backbone, modality-specific tokenizers and task-specific output heads. By co-training on different tasks of a single modality, we are able to achieve significant accuracy improvements on 5 standard video- and audio-classification datasets. Furthermore, co-training PolyViT on multiple modalities and tasks leads to a parameter-efficient model which generalizes across multiple domains. In particular, our multi-modal PolyViT trained on 9 datasets across 3 modalities uses 8.3 times fewer parameters and outperforms a state-of-the-art single-task baseline on 2 of these datasets, whilst achieving competitive performance on the others. Finally, this simple and practical approach necessitates less hyperparameter tuning as the per-task hyperparameters can be readily reused.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: First revision:
- Updated the abstract to address reviewer comment
- In Section 2, update the related work to mention Perceiver-IO and Unified-IO. Also, update description of MultiModel.
- In Section 4, clarify that our method applies to any first-order gradient-based optimiser. Also add detail about gradient accumulation.
- In Section 5, correct bold-facing in audio results table.
- In Section 6, Add discussion of limitations to the conclusion.
- Add additional experiment on up-stream co-training with images and text into the appendix.
Camera ready:
Add links to the code release
Code: https://github.com/google-research/scenic/tree/main/scenic/projects/polyvit
Assigned Action Editor: ~Yanwei_Fu2
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 404
Loading