PolyViT: Co-training Vision Transformers on Images, Videos and AudioDownload PDF

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone
Keywords: transformers, multi-task learning, image classification, video, audio, co-training
Abstract: Can we train a single transformer model capable of processing multiple modalities and datasets, whilst sharing the majority of its learnable parameters? We present PolyViT, a model trained on image, audio and video which answers this question. By co-training different tasks on a single modality we are able to improve the accuracy of each individual task and achieve state-of-the-art results on 5 standard video- and audio-classification datasets. Co-training PolyViT on multiple modalities and tasks leads to a model that is even more parameter-efficient, and learns representations that generalize across multiple domains. Finally, we show that co-training is simple and practical to implement, as we do not need to tune hyperparameters for each combination of datasets, but can simply adapt those from standard, single-task training.
One-sentence Summary: We present a PolyViT, a single ViT model co-trained on three modalities (image, audio and video) and multiple datasets per modality.
5 Replies

Loading