PolyViT: Co-training Vision Transformers on Images, Videos and Audio

Valerii Likhosherstov; Anurag Arnab; Krzysztof Marcin Choromanski; Mario Lucic; Yi Tay; Mostafa Dehghani

PolyViT: Co-training Vision Transformers on Images, Videos and Audio

Valerii Likhosherstov, Anurag Arnab, Krzysztof Marcin Choromanski, Mario Lucic, Yi Tay, Mostafa Dehghani

Published: 09 Jan 2023, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Can we train a single transformer model capable of processing multiple modalities and datasets, whilst sharing almost all of its learnable parameters? We present PolyViT, a model trained on images, audio and video to answer this question. PolyViT consists of a single transformer backbone, modality-specific tokenizers and task-specific output heads. By co-training on different tasks of a single modality, we are able to achieve significant accuracy improvements on 5 standard video- and audio-classification datasets. Furthermore, co-training PolyViT on multiple modalities and tasks leads to a parameter-efficient model which generalizes across multiple domains. In particular, our multi-modal PolyViT trained on 9 datasets across 3 modalities uses 8.3 times fewer parameters and outperforms a state-of-the-art single-task baseline on 2 of these datasets, whilst achieving competitive performance on the others. Finally, this simple and practical approach necessitates less hyperparameter tuning as the per-task hyperparameters can be readily reused.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: First revision: - Updated the abstract to address reviewer comment - In Section 2, update the related work to mention Perceiver-IO and Unified-IO. Also, update description of MultiModel. - In Section 4, clarify that our method applies to any first-order gradient-based optimiser. Also add detail about gradient accumulation. - In Section 5, correct bold-facing in audio results table. - In Section 6, Add discussion of limitations to the conclusion. - Add additional experiment on up-stream co-training with images and text into the appendix. Camera ready: Add links to the code release

Code: https://github.com/google-research/scenic/tree/main/scenic/projects/polyvit

Assigned Action Editor: ~Yanwei_Fu2

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 404

Loading