Efficient Transfer Learning from Arbitrary Pre-Trained Models

Shikai Qiu; Boran Han; Danielle C. Maddix; Shuai Zhang; Bernie Wang; Andrew Gordon Wilson

Efficient Transfer Learning from Arbitrary Pre-Trained Models

Shikai Qiu, Boran Han, Danielle C. Maddix, Shuai Zhang, Bernie Wang, Andrew Gordon Wilson

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: transfer learning, meta learning, and lifelong learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Transfer learning, Foundation models

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We introduce a compute efficient method for transfer learning from any type and number of pre-trained models.

Abstract: Transfer learning typically involves loading pre-trained weights as an initialization, followed by fine-tuning on a downstream task. As pre-trained models become ever larger, this procedure is becoming prohibitively expensive, as we are forced to re-use the pre-trained architecture for fine-tuning. This procedure also precludes combining multiple pre-trained models that learn complementary information. Moreover, alternatives such as knowledge distillation do not reflect that we wish to transfer aspects of the pre-trained representation that are most relevant to the downstream task. To address these challenges, we introduce Adaptive Feature Transfer (AFT). Instead of transferring weights, AFT operates purely on features, thereby decoupling the choice of the pre-trained model from the possibly smaller downstream model. AFT (1) enables transfer from multiple pre-trained models, even over multiple modalities, with minimal training overhead and no inference overhead; (2) selectively transfers the information in the pre-trained features most relevant for the downstream task, through a prior that favors low mutual information between the downstream inputs and features given the pre-trained features; (3) performs feature transfer in an efficient kernel formulation that prioritizes the most relevant degrees of freedom. Empirically, AFT delivers a substantial boost in performance across diverse vision, language, and multi-modal datasets, relative to both standard transfer learning and knowledge distillation with the downstream model.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8472

Loading