Beyond Traditional Transfer Learning: Co-finetuning for Action LocalisationDownload PDF

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone
Keywords: transformer, video, action recognition, action detection, multi-task learning, co-training, transfer learning
Abstract: Transfer learning is the predominant paradigm for training deep networks on small target datasets. Models are typically pretrained on large “upstream” datasets for classification, as such labels are easy to collect, and then finetuned on downstream” tasks such as action localisation, which are smaller due to their finer-grained annotations. In this paper, we question this approach, and propose co-finetuning -- simultaneously training a single model on multiple “upstream” and “downstream” tasks. We demonstrate that co-finetuning outperforms traditional transfer learning when using the same total amount of data, and also show how we can easily extend our approach to multiple “upstream” datasets to further improve performance. In particular, co-finetuning significantly improves the performance on rare classes in our downstream task, as it has a regularising effect, and enables the network to learn feature representations that transfer between different datasets. Finally, we observe how co-finetuning with public, video classification datasets, we are able to achieve state-of-the-art results for spatio-temporal action localisation on the challenging AVA and AVA-Kinetics datasets, outperforming recent works which develop intricate models.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
TL;DR: Instead of pretraining on "upstream" datasets and then finetuning on "downstream" tasks, we simultaneously train on all datasets, achieving significant performance improvements across all tasks, and particularly on rare classes.
5 Replies

Loading