ZEST: ZEROSHOT SPARSE FINE-TUNING

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: transfer learning, meta learning, and lifelong learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: deep learning; surgical fine-tuning; efficient fine-tuning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Recent studies have pointed out that fine-tuning a subset of layers from the model can match or even outperform the performance of full fine-tuning, known as surgical fine-tuning ~\citep{lee2022surgical}. This method effectively helps reduce the risks of overfitting and accelerates the fine-tuning process. However, swiftly and accurately identifying the "right" layers is not straightforward. Existing approaches naively train each layer until convergence and find the best candidates, which is not scalable, especially given the rapid growth in model sizes. In this paper, we propose $\textbf{ZEST}$: $\textbf{Ze}$roshot $\textbf{S}$parse fine-$\textbf{T}$uning. We first study and compare the zero-shot metrics acquired from a single forward and backward pass. We observe that the metrics are inconsistent for different model and dataset combinations, thus we train a universal \method predictor to generalize this method. We use the zero-shot \method predictor to rank layers by the estimated importance and fine-tune only the important parameters. By doing so, we can decrease the number of trainable parameters by up to 99\%, being on par or outperforming full fine-tuning in terms of model performance. We thoroughly evaluate the effectiveness of \method on various tasks and modalities. We train a universal predictor for ResNet50, MobilenetV2, and EfficientNet on 8 different datasets. We also scale this method up to BERT and LLAMA. Our results demonstrate that fine-tuning just five layers can closely match or even outperform the performance achieved through full fine-tuning on LLaMA-7B. Specifically, fine-tuning only the \textbf{5} fully connected layers on LLaMA chosen by \method can result in improvements of up to 5\% over full fine-tuning
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3475
Loading