A Kernel-Based View of Language Model Fine-Tuning

Sadhika Malladi; Alexander Wettig; Dingli Yu; Danqi Chen; Sanjeev Arora

A Kernel-Based View of Language Model Fine-Tuning

Sadhika Malladi, Alexander Wettig, Dingli Yu, Danqi Chen, Sanjeev Arora

Published: 01 Feb 2023, Last Modified: 22 Jun 2025Submitted to ICLR 2023Readers: Everyone

Keywords: language models, fine-tuning, neural tangent kernels, tensor programs

TL;DR: We show when language model fine-tuning can be in the kernel regime and derive a new kernel formula to describe Adam, and we empirically validate that this kernel can solve many downstream tasks.

Abstract: It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK)--which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization--describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We also extend the NTK formalism to fine-tuning with Adam. We present extensive experiments that suggest that once the task is formulated as a masked language modeling problem through prompting, the NTK lens can often reasonably describe the model updates during fine-tuning with both SGD and Adam. This kernel view also suggests an explanation for success of parameter-efficient subspace-based fine-tuning methods. Finally, we suggest a path toward a formal explanation for our findings via Tensor Programs (Yang, 2020).

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Theory (eg, control theory, learning theory, algorithmic game theory)

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/a-kernel-based-view-of-language-model-fine/code)

12 Replies

Loading