\documentclass{article}


% if you need to pass options to natbib, use, e.g.:
%     \PassOptionsToPackage{numbers, compress}{natbib}
% before loading neurips_2024


% ready for submission
\usepackage{neurips_2024}


% to compile a preprint version, e.g., for submission to arXiv, add add the
% [preprint] option:
    % \usepackage[preprint, nonatbib]{neurips_2024}
    % \usepackage[numbers]{natbib}


% to compile a camera-ready version, add the [final] option, e.g.:
%     \usepackage[final]{neurips_2024}


% to avoid loading the natbib package, add option nonatbib:
%    \usepackage[nonatbib]{neurips_2024}

\usepackage[dvipdfmx]{graphicx}
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage[hypertexnames=false]{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
\usepackage{multicol,multirow}
\usepackage{here}
\usepackage{subcaption}
\usepackage{amsmath,amssymb, stmaryrd, amsthm}
\usepackage{bm, bbm}
\usepackage[whole]{bxcjkjatype} % Japanese
\usepackage{wrapfig} %回り込みの図表
\usepackage{siunitx} % SI unit
\usepackage{multicol} % 2段組
\usepackage{cleveref}
\usepackage{autonum}
\sisetup{
  scientific-notation = true,
  exponent-product = $\times$,
  output-exponent-marker = $10^$
}


\makeatother
\theoremstyle{plain}
\newtheorem{thm}{Theorem}
\newtheorem{prop}{Proposition}
\newtheorem{lem}{Lemma}
\newtheorem{corollary}{Corollary}
\newtheorem{definition}{Definition}
\setlength{\intextsep}{\baselineskip}    % 本文内の図の上下の空白
\setlength{\textfloatsep}{\baselineskip} % 本文とページ上部または下部の図の間の空白
\captionsetup[figure]{
  skip=\baselineskip % 図のキャプションの上下の間隔
}
\captionsetup[table]{
  skip=\baselineskip % 図のキャプションの上下の間隔
}

\title{Understanding Linear Probing then Fine-tuning Language Models from NTK Perspective}
% \title{Understanding Feature Distortion of Fine-tuning Language Models from an NTK Perspective}


% The \author macro works with any number of authors. There are two commands
% used to separate the names and addresses of multiple authors: \And and \AND.
%
% Using \And between authors leaves it to LaTeX to determine where to break the
% lines. Using \AND forces a line break at that point. So, if LaTeX puts 3 of 4
% authors names on the first line, and the last on the second line, try using
% \AND instead of \And before the third author name.

\DeclareMathOperator*{\argmax}{\mathrm{arg}\,\mathrm{max}}
\DeclareMathOperator*{\argmin}{\mathrm{arg}\,\mathrm{min}}
\DeclareMathOperator*{\argsup}{\mathrm{arg}\,\mathrm{sup}}
\DeclareMathOperator*{\arginf}{\mathrm{arg}\,\mathrm{inf}}
\begin{document}
\maketitle
\begin{abstract}
  The two-stage fine-tuning (FT) method, linear probing then fine-tuning (LP-FT), consistently outperforms linear probing (LP) and FT alone in terms of accuracy for both in-distribution (ID) and out-of-distribution (OOD) data. This success is largely attributed to the preservation of pre-trained features, achieved through a near-optimal linear head obtained during LP. However, despite the widespread use of large language models, the exploration of complex architectures such as Transformers remains limited. In this paper, we analyze the training dynamics of LP-FT for classification models on the basis of the neural tangent kernel (NTK) theory. Our analysis decomposes the NTK matrix into two components, highlighting the importance of the linear head norm alongside the prediction accuracy at the start of the FT stage. We also observe a significant increase in the linear head norm during LP, stemming from training with the cross-entropy (CE) loss, which effectively minimizes feature changes. Furthermore, we find that this increased norm can adversely affect model calibration, a challenge that can be addressed by temperature scaling. Additionally, we extend our analysis with the NTK to the low-rank adaptation (LoRA) method and validate its effectiveness. Our experiments with a Transformer-based model on natural language processing tasks across multiple benchmarks confirm our theoretical analysis and demonstrate the effectiveness of LP-FT in fine-tuning language models.
  \end{abstract}
%------------------------------------------------------------------------------------------
\section{Introduction}
Fine-tuning pre-trained models for new tasks is a common practice across various fields. However, simply fine-tuning the entire model can lead to overfitting on training data, which may negatively impact generalization and out-of-distribution (OOD) performance~\citep{Li2020Rethinking, lee2023surgical}. To address this, the two-stage approach known as linear probing then fine-tuning (LP-FT)~\citep{kumar2022fine} has demonstrated high performance on both in-distribution (ID) and OOD data. Initially, linear probing (LP) optimizes only the linear head of the model, after which fine-tuning (FT) updates the entire model, including the feature extractor and the linear head. This method has been extensively analyzed and enhanced~\citep{trivedi2023a, ren2023how, ha2024domain, kirichenko2023last}.
% Fine-tuning (FT) pre-trained models for new tasks is a common practice across various fields. However, simply FT the entire model can lead to overfitting on training data, which may negatively impact generalization and out-of-distribution (OOD) performance~\citep{Li2020Rethinking, lee2023surgical}. To address this, the two-stage approach known as linear probing (LP) then FT (LP-FT)~\citep{kumar2022fine} has demonstrated high performance on both in-distribution (ID) and OOD data. Initially, LP optimizes only the linear head of the model, after which FT updates the entire model, including the feature extractor and the linear head. This method has been extensively analyzed and enhanced~\citep{trivedi2023a, ren2023how, ha2024domain, kirichenko2023last}.

The feature distortion theory, introduced by~\citet{kumar2022fine}, explains the effectiveness of LP-FT on the basis of a theoretical analysis with a two-layer linear model. This theory suggests that LP-FT minimizes changes to pre-trained features by starting FT with an already optimized linear head from LP. However, our understanding of LP-FT, particularly when applied to complex architectures such as Transformers~\citep{vaswani2017attention}, remains incomplete. Thus, it is crucial to further explore the training dynamics of LP-FT in more complex models than the two-layer linear model.

In this paper, we apply the neural tangent kernel (NTK) theory~\citep{jacot2018neural} to clarify the mechanisms underlying LP-FT, focusing on the training dynamics of classification models. The NTK is a theoretical tool that analyzes training dynamics by applying a first-order approximation to changes in the model outputs with respect to its parameters. Therefore, the NTK is suited for analyzing feature changes during FT dynamics~\citep{wei2022more,malladi2023kernel}. Our analysis reveals that after LP, both more accurate predictions and increased norms of the linear head compared to their initial values contribute to minimizing feature changes. We then identify a significant increase in the linear head during LP from the analysis of training with cross-entropy (CE) loss, which contributes to small feature changes in the FT stage. On the other hand, we found that this increase in the linear head norm can worsen calibration, causing predicted probabilities to deviate from actual probabilities, which can be corrected with temperature scaling~\citep{pmlr-v70-guo17a}. Furthermore, we extend our analysis based on the NTK to the low-rank adaptation (LoRA) method~\citep{hu2022lora}, a parameter-efficient fine-tuning strategy, and validate its effectiveness.

Our contributions are summarized as follows:
\vspace{-0.3\baselineskip}
{
\setlength{\leftmargini}{10pt}
\begin{itemize}
  \setlength{\itemsep}{4pt}
	\setlength{\parskip}{0pt}
	\setlength{\itemindent}{0pt}
	\setlength{\labelsep}{5pt}
    \item We show that both accurate predictions and increased norms of the linear head during LP reduce feature changes in LP-FT within the NTK regime (\Cref{sec:lp_ft_ntk}), which is consistent with the feature distortion theory. (\Cref{cor:feature_distortion}).
    \item We find that the norms of the linear head significantly affect the balance of the NTK matrix components and influences the training dynamics of FT (\Cref{prop:ntk}).
    \item We also highlight that increased linear head norms can negatively affect model calibration, and this can be fixed with temperature scaling.
    \item We extend our analysis based on the NTK to the LoRA method and provide a theoretical validation of its efficacy (\Cref{prop:lora}).
\end{itemize}
}
\vspace{-0.5\baselineskip}
%------------------------------------------------------------------------------------------
\section{Related work}
\paragraph{LP-FT}
FT and LP are well-established transfer learning techniques with extensive empirical and theoretical studies~\citep{zhuang2020comprehensive, kornblith2019better, tripuraneni2020theory}.~\citet{kumar2022fine} analyzed the effectiveness of these techniques using a two-layer linear model. Then, they proposed LP-FT that is a combined approach of LP then FT. Building on this study, subsequent studies have explored LP-FT in more detail.~\citet{trivedi2023a} investigated LP-FT through the lens of safety objectives, proposing modifications to mitigate simplicity bias.~\citet{ren2023how} analyzed LP-FT from the perspective of the initial discrepancy between predicted and actual probabilities, emphasizing the importance of the number of probing epochs during LP.~\citet{ha2024domain} further improved LP-FT by aligning batch normalization layers with the target domain.~\citet{kirichenko2023last} highlighted the challenge that models depend on spurious features and proposed last-layer retraining as a cost-effective strategy to improve model robustness.
%------------------------------------------------------------------------------------------
\paragraph{Other FT methods}
Various FT strategies other than LP-FT have been proposed, including two-stage approaches~\citep{zhang2020side}, regularization-based techniques~\citep{jiang2019smart}, and parameter-efficient fine-tuning methods~\citep{pmlr-v97-houlsby19a, he2022towards}. One prominent example of a parameter-efficient method is LoRA, proposed by~\citet{hu2022lora}. This approach draws inspiration from the concept of intrinsic dimensions~\citep{aghajanyan-etal-2021-intrinsic}, suggesting that data can be effectively represented in a lower-dimensional space. ~\citet{zeng2024the} explored the expressive power of LoRA, and~\citet{jang2024lora} provided a theoretical analysis of its convergence properties. However, challenges remain in parameter-efficient FT methods, including potential instability issues identified by~\citet{chen-etal-2022-revisiting}.
%------------------------------------------------------------------------------------------
\paragraph{Neural tangent kernel (NTK)}
The NTK, which was first introduced by~\citet{jacot2018neural}, has become a valuable tool for analyzing the training dynamics of neural networks. Studies by~\citet{lee2019wide} and~\citet{arora2019exact} used the NTK to gain insights into how networks learn. Building on this foundation,~\citet{wei2022more} introduced the concept of the empirical NTK, which extends the application of NTK to FT scenarios. This approach replaces the randomly initialized parameters in the standard NTK with the parameters of the pre-trained models. Further expanding on the empirical NTK,~\citet{malladi2023kernel} conducted a theoretical and experimental investigation and found that prompt-based fine-tuning exhibits behavior consistent with the predictions of the kernel framework.~\citet{jang2024lora} extended this perspective to analyze LoRA.
%------------------------------------------------------------------------------------------
\section{Preliminary}
In this section, we provide an overview of the FT methods used in this paper, followed by a brief explanation of the NTK.
%------------------------------------------------------------------------------------------
\paragraph{LP-FT}
In standard FT, the parameters of the linear head, weight $\bm{V}$ and bias $\bm{b}$, are initialized with random values. In contrast, in LP-FT, LP is conducted before the FT stage, and the FT stage is started with the obtained parameters. The performance of LP-FT is higher than that of LP and FT on both ID and OOD data~\citep{kumar2022fine}. The original LP-FT paper~\citep{kumar2022fine} explains the reason behind it as the feature distortion theory: the success of LP-FT stems from the minimal feature changes because of starting the FT stage with the linear head parameters which are close to the optimal solution. We analyze the training process of LP-FT throughout this paper.
%------------------------------------------------------------------------------------------
\paragraph{LoRA}
LoRA~\citep{hu2022lora} introduces trainable rank decomposition matrices into each layer of the Transformer architecture. This approach, inspired by the concept of ``intrinsic dimensions" from~\citet{aghajanyan-etal-2021-intrinsic}, constrains updates to pre-trained weight matrices via low-rank decomposition. The update of a pre-trained weight matrix $\bm{W_0} \in \mathbb{R}^{q\times s}$ is approximated by $\bm{W} + \Delta \bm{W} = \bm{W_0} + \bm{B}^{\text{LoRA}}\bm{A}^{\text{LoRA}}$, where $\bm{B}^{\text{LoRA}}\in \mathbb{R}^{q\times r}$ and $\bm{A}^{\text{LoRA}}\in \mathbb{R}^{r\times s}$ are the only matrices optimized during fine-tuning. Here, $r \ll \min (q, s)$ represents the small intrinsic rank of the weight matrix, reflecting the low-rank approximation. The standard initialization of $\bm{B}^{\text{LoRA}}$ and $\bm{A}^{\text{LoRA}}$ is $\bm{B}^{\text{LoRA}}=O$ and $\bm{A}^{\text{LoRA}}$ is drawn from a normal distribution with mean $0$.
%------------------------------------------------------------------------------------------
\paragraph{Neural tangent kernel (NTK)}
\citet{jacot2018neural} introduced the NTK, which captures the training dynamics over time. They demonstrated that in the infinite width limit, the NTK remains constant. In this limit, training dynamics are governed by a linear model derived from a first-order Taylor expansion around the initial parameters of the network, known as the linearized or NTK regime~\citep{lee2019wide}. For networks with finite width, this limiting kernel depends on the initialization parameters and is known as the empirical NTK~\citep{wei2022more}. Although the empirical NTK differs from the infinite width limit, it is valuable for analyzing the local training dynamics of models~\citep{ren2022better, fort2020deep, mohamadi2023a, wei2022more, jang2024lora}, and has been used in FT~\citep{ren2023how, malladi2023kernel}.
%------------------------------------------------------------------------------------------
\section{Analysis of LP-FT from NTK perspective}
The original analysis of LP-FT by~\citet{kumar2022fine} is based on a two-layer linear model and suggests the feature distortion theory, which suggests that the minimal changes in pre-trained features are the reason behind the robust performance of LP-FT. In this section, we use the NTK theory to analyze LP-FT to better understand the training dynamics of LP-FT in complex models like Transformers and meet the demands of modern deep learning models. After introducing the notation, we discuss the increase in the classifier weight norm during training, followed by the training dynamics in the NTK regime. We then extend our analysis to the LoRA method. These analyses suggest the LP-FT reduces feature distortion with the increased norm of the classifier weight and the near-optimal prediction after LP.
\label{sec:lp_ft_ntk}
\subsection{Notation}
Let $\mathcal{X} = \{\bm{x}_1, \ldots, \bm{x}_N\} \subseteq \mathbb{R}^d$ represent the training samples, paired with labels from the set $\mathcal{Y} = \{y_1, \ldots, y_N\} \subseteq \{1, 2, \ldots, C\}$, where $d$, $C$, and $N$ denote the input space dimension, the number of the class, and the number of training samples, respectively. This results in a training dataset $\{(\bm{x}_{1}, y_{1}), \ldots, (\bm{x}_{N}, y_{N}) \mid \bm{x}_{i} \in \mathcal{X}, y_{i} \in \mathcal{Y}\}$, and we use $\bm{x} \in \mathbb{R}^d$ to denote both a training sample and a test sample. We denote the $k$-th element of the vector $\bm{a}$ as $[\bm{a}]_{k}$. We use the Euclidean norm $\|\cdot\|$ for vectors and the Frobenius norm $\|\cdot\|_F$ for matrices. $\langle \cdot, \cdot \rangle$ denotes the inner product of two vectors. $\bm{e}_k$ represents the one-hot vector for class $k$, and $\bm{I}_C$ is the identity matrix of size $C$.

The model function, denoted as $\bm{f}(\cdot;\theta): \mathcal{X} \rightarrow \mathbb{R}^C$, is parameterized by a set of parameters $\theta$ and sometimes abbreviated as $\bm{f}(\cdot)$. The model includes a linear head (classifier) consisting of a weight matrix $\bm{V}$ and bias vector $\bm{b}$, and a feature extractor denoted by $\bm{\phi}(\cdot): \mathbb{R}^h \rightarrow \mathbb{R}^C$, where $h$ represents the feature dimension. The output of the model is given by $\bm{f}(\bm{x};\theta) = \bm{V}\bm{\phi}(\bm{x};\theta) + \bm{b}$. The parameters for a function $g(\cdot)$ and matrix $\bm{A}$ are denoted as $\theta^g$ and $\theta^A$, respectively. Subscripts represent iteration or epoch, so $\bm{f}_t(\cdot)$ denotes the model at time $t$.

With the loss function $\ell: \mathbb{R}^C \times \mathcal{Y} \rightarrow \mathbb{R}$, the training objective is to find the optimal parameters $\theta$ that minimize the empirical risk $L$ defined as $L(\bm{f}) :=L(\bm{f}(\cdot;\theta))= \frac{1}{N} \sum_{i=1}^{N} \ell(\bm{f}(\bm{x}_{i};\theta)), y_{i})$. In this paper, we use the CE loss, $\ell(\bm{f}(\bm{x}), y) = -\log\left( [\bm{\sigma}_{\text{SM}}(\bm{f}(\bm{x}))]_{y}\right)$, where $\bm{\sigma}_{\text{SM}} : \mathbb{R}^C \rightarrow \mathbb{R}^C$ is the softmax function with its $k$-th element given by $[\bm{\sigma}_{\text{SM}}(\bm{f}(\bm{x}))]_{k} = \frac{\exp([\bm{f}(\bm{x})]_{k)}}{\sum_{k'}\exp([\bm{f}(\bm{x})]_{k'})}$.
%-------------------------------------------------------------------------------------------
\subsection{Training dynamics in the NTK regime}
\label{subsec:ntk_analysis}
We use the NTK~\citep{jacot2018neural}, more specifically the empirical NTK~\citep{wei2022more, malladi2023kernel}, to analyze the training dynamics of both FT and LP-FT. The empirical NTK, defined as the NTK with the parameters at the start of training, is a valuable tool for understanding the neural network training process, particularly in the context of FT~\citep{wei2022more, malladi2023kernel, ren2023how}. The empirical NTK applies a first-order approximation to changes in model outputs with respect to its parameters, so this is expected to capture changes in features.

To investigate the feature distortion theory in FT and LP-FT, we decomposed the updates into the following two parts. The part influenced by feature updates, unique to FT and absent in LP, is termed the \textit{FT-effective} component of the NTK matrix, represented as $\bm{F}(\bm{x}, \bm{x}_i)$. In contrast, the part not influenced by feature updates, common to both FT and LP, determined by the pre-trained model, is termed the \textit{pre-train-effective} component, represented as $\bm{P}(\bm{x}, \bm{x}_i)$. This decomposition highlights the distinct training dynamics of LP-FT in the NTK regime in the following proposition.
%------------------------------------------------------------------------------------------
\begin{prop}[FT in the NTK regime]
  \label{prop:ntk}
  The NTK of a model $\bm{f}(\bm{x})=\bm{V}\bm{\phi}(\bm{x})+\bm{b}$, denoted by $\Theta^{\bm{f}}$, can be decomposed as:
  \begin{align}
    \Theta^{\bm{f}}(\bm{x}, \bm{x}_i) = \bm{P}(\bm{x}, \bm{x}_i) + \bm{F}(\bm{x}, \bm{x}_i),
  \end{align}
  where the pre-train-effective component $\bm{P}(\bm{x}, \bm{x}_i)$ and the FT-effective component $\bm{F}(\bm{x}, \bm{x}_i)$ are defined using the classifier weight matrix $\bm{V}_0$ and the feature extractor $\bm{\phi}_0$ at starting point of training as:
  \begin{align}
    \bm{P}(\bm{x}, \bm{x}_i) &:= (\langle \bm{\phi}_0(\bm{x}), \bm{\phi}_0(\bm{x}_i)\rangle + 1) \bm{I}_{C},\\
    \bm{F}(\bm{x}, \bm{x}_i) &:=  \bm{V}_0 \frac{\partial \bm{\phi}_0(\bm{x})}{\partial \theta^{\bm{\phi}}} \frac{\partial \bm{\phi}_0(\bm{x}_i)}{\partial \theta^{\bm{\phi}}}^\top \bm{V}_0^\top.
  \end{align}
  Consequently, assuming that one-epoch training within the NTK regime approximates FT, the logits and feature vectors for a sample $\bm{x}$ after FT, denoted as $\bm{f}^{\text{FT}}(\bm{x})$ and $\bm{\phi}^{\text{FT}}(\bm{x})$, to the starting point of training, $\bm{f}_0(\bm{x})$ and $\bm{\phi}_0(\bm{x})$, can be expressed as:
  \begin{align}
    \bm{f}^{\text{FT}}(\bm{x}) - \bm{f}_0(\bm{x})
    &= \eta \sum_{i=1}^N \left(\bm{P}(\bm{x}, \bm{x}_i) +  \bm{F}(\bm{x}, \bm{x}_i)\right) \bm{\delta}_i, \label{eq:ntkTraining}\\
    \bm{\phi}^{\text{FT}}(\bm{x}) - \bm{\phi}_0(\bm{x})
    &= \eta \sum_{i=1}^N \Theta^{\bm{\phi}}(\bm{x}, \bm{x}_i) \bm{V}_0^\top \bm{\delta}_i, \label{eq:feature}
  \end{align}
  where $\Theta^{\bm{\phi}}$ is the NTK matrix of the feature extractor $\bm{\phi}$, $\bm{\delta}_i := \bm{e}_{y_i} - \bm{\sigma}_{\text{SM}}(\bm{f}_0(\bm{x}_i))$ represents the difference between the one-hot label for the class $y_i$ and the predicted probability, and $\eta$ is the learning rate.
\end{prop}
The proof of this proposition is included in the Appendix (\Cref{subsec:ntk}). In our decomposition of the NTK matrix, the pre-train-effective component $\bm{P}(\bm{x}, \bm{x}_i)$ is a diagonal matrix and remains unchanged after LP, while the FT-effective component $\bm{F}(\bm{x}, \bm{x}_i)$ is not a diagonal matrix and does change after LP, resulting in distinct characteristics for these components. The Frobenius norm of the classifier weight matrix, $\|\bm{V}_0\|_F$, influences the balance between the pre-train-effective and FT-effective components because it affects only the FT-effective component. This indicates that the classifier weight norm $\|\bm{V}_0\|_F$ has a significant impact on the training dynamics of FT.
%------------------------------------------------------------------------------------------
\paragraph{Hypothesis on reduced feature changes in LP-FT}
The above proposition provides insights into why LP-FT causes fewer feature changes compared to FT:
\vspace{-0.5\baselineskip}
{
\setlength{\leftmargini}{10pt}

\begin{enumerate}
	\setlength{\itemsep}{1pt}
	\setlength{\parskip}{0pt}
	\setlength{\itemindent}{0pt}
	\setlength{\labelsep}{1pt}
  \item The impact of the classifier weight norm $\|\bm{V}_0\|_F$ differs in the equations: it affects feature changes linearly~\eqref{eq:feature} and affects logits quadratically~\eqref{eq:ntkTraining}. This implies that a higher norm can result in significant logit updates with relatively minor changes to the feature extractor, reducing feature changes in LP-FT compared with FT due to the increased classifier weight norm after LP.
  \item The magnitude of changes in both features and logits (\eqref{eq:ntkTraining} and~\eqref{eq:feature}), is proportional to $\bm{\delta}_i$, the difference between the predicted probability and the one-hot label. This suggests that feature changes are less pronounced in LP-FT than in FT since the difference $\bm{\delta}_i$ is smaller after LP.
  \item The learning rate $\eta$, typically smaller in LP-FT than in FT~\citep{kumar2022fine, ren2023how, ha2024domain}, helps moderate the direct influence of large classifier weight norms.
\end{enumerate}
}
\vspace{-0.5\baselineskip}
%------------------------------------------------------------------------------------------
Prior studies~\citep{kumar2022fine, ren2023how} have suggested that reduced feature changes in LP-FT stem from the near-optimal linear head obtained during LP. However, our analysis reveals that feature changes in LP-FT are also influenced by the classifier weight norm $\bm{V}_0$ after LP. Our analysis focusing on classifier weight norms provides a new perspective on the training dynamics of LP-FT, highlighting the importance of the classifier weight norm in reducing feature distortion.
%------------------------------------------------------------------------------------------
\subsection{Derivation of Lemma A.3 from Kumar et al. in the NTK regime}
\label{subsec:two_layer}
The analysis presented in the original LP-FT paper by Kumar et al.~\citep{kumar2022fine} operates within a framework where the feature extractor is a linear function. We define this framework in our context as follows:
\begin{definition}[Linear model~\citep{kumar2022fine}]
  \label{dfn:linearModel}
  A linear model is defined as $\bm{f}(\bm{x}) = \bm{V}\bm{\phi}(\bm{x}) + \bm{b}$, where $\bm{\phi}(\bm{x}) = \bm{B}\bm{x}$ denotes the feature extractor, $\bm{V} \in \mathbb{R}^{C \times h}$ is the classifier weight matrix, and $\bm{B} \in \mathbb{R}^{h \times d}$ is the weight matrix of the feature extractor.
\end{definition}
In this setting, we derive a corollary from \Cref{prop:ntk} in our context, which is the pivotal lemma in the original LP-FT analysis~\citep{kumar2022fine}:
\begin{corollary}[Lemma A.3 from Kumar et al. in the NTK regime]
 \label{cor:feature_distortion}
  Within the context of the linear model (\Cref{dfn:linearModel}), for any sample $\bm{x} \in \operatorname{Span}(\mathcal{X})^{\bot}$, the orthogonal complement of the subspace spanned by the training sample set $\mathcal{X}$, the features after FT remain unchanged, expressed as:
  \begin{align}
    \bm{\phi}^{\text{FT}}(\bm{x}) = \bm{\phi}_0(\bm{x}),
  \end{align}
  where $\bm{\phi}^{\text{FT}}(\bm{x})$ and $\bm{\phi}_0(\bm{x})$ denote the feature vectors after and before FT, respectively.
\end{corollary}
This corollary shows that feature vectors for the samples in the orthogonal complement of training sample subspace are not updated. Therefore, given that pre-trained features have characteristics beneficial to downstream tasks, significant feature changes in FT, dependent on small training samples in LP, lead to poor generalization and OOD performance. The proof of this lemma can be found in the Appendix (\Cref{proof:corollary}).
%---------------------------------------------------------------------------------------------
\subsection{Increase in the classifier weight norm}
\label{subsec:norm_increase}
\begin{figure}[tbp]
  %notebook/00_paper/prediction_analysis.ipynb
    \centering
    \begin{minipage}[b]{0.32\linewidth}
      \centering
      \includegraphics[width=\linewidth]{images/prediction_analysis/norm_increase_LP_noreg_rte.png}
      \subcaption{LP}
  \end{minipage}
  \hfill
  \vspace{0.5\baselineskip}
  \centering
  \begin{minipage}[b]{0.32\linewidth}
    \centering
    \includegraphics[width=\linewidth]{images/prediction_analysis/norm_increase_normal_rte_ft.png}
    \subcaption{FT}
\end{minipage}
\hfill
\centering
\begin{minipage}[b]{0.32\linewidth}
  \centering
  \includegraphics[width=\linewidth]{images/prediction_analysis/norm_final_rte_weight_norms.png}
  \subcaption{After training}
\end{minipage}
%notebook/00_paper/prediction_analysis.ipynb
  \caption{Increase in classifier weight norms during training on the RTE dataset. (a) and (b) show the increase of the both accuracy and classifier weight norms with training. (c) shows classifier weights norms after training.}
  \label{fig:norm_increase}
\end{figure}
The analysis in the previous section suggests that the classifier weight norm affects both feature changes and logits. On the basis of this insight, we examine classifier weight norms during training. \Cref{fig:norm_increase} shows that classifier weight norms consistently increase over time for LP, standard FT, and LoRA. As the training proceeds, norms of classifier  bias and logits increases, while training loss decreases. Notably, LP shows a significantly larger increase in the norm compared to FT and LoRA.

Consider the transpose of the $k$-th row of matrix $\bm{V}$ denoted as $\bm{v}_{k} \in \mathbb{R}^{h}$ for $1 \leq k \leq C$, where $C$ is the number of classes. Let $\tau_{ki}$ represent the angle between $\bm{\phi}(\bm{x_i})$ and $\bm{v}_k$, which expands $\langle \bm{v}_k, \bm{\phi}(\bm{x_i}) \rangle$ to $\|\bm{v}_k\|\|\bm{\phi}(\bm{x_i})\|\cos\tau_{ki}$. The probability that class $k$ is chosen for sample $\bm{x}_i$ is given by the softmax function $[\bm{\sigma}_{\text{SM}}\left( \bm{f}(\bm{x_i}) \right)]_{k} = \frac{\exp(\langle \bm{v}_k, \bm{\phi}(\bm{x_i}) \rangle)}{\sum_{k'}\exp(\langle \bm{v}_{k'}, \bm{\phi}(\bm{x_i}) \rangle)}$. Consequently, with the CE loss for an input $\bm{x}_i$ classified into class $y_i$ defined as $\ell(\bm{f}(\bm{x_i}), y_{i}) = -\log \left([\bm{\sigma}_{\text{SM}}\left( \bm{f}(\bm{x_i}) \right)]_{y_{i}}\right)$, we have the following partial derivatives:
{\small
\begin{align}
  \frac{\partial \ell(\bm{f}(\bm{x_i}), y_{i})}{\partial \cos\tau_{ki}} &=
  \begin{cases}
      [\bm{\sigma}_{\text{SM}}\left( \bm{f}(\bm{x_i}) \right)]_{k}\|\bm{v}_k\|\|\bm{\phi}(\bm{x_i})\| & \text{if } k \neq y_{i},\\
      -(1-[\bm{\sigma}_{\text{SM}}\left( \bm{f}(\bm{x_i}) \right)]_{y_{i}})\|\bm{v}_{y_i}\|\|\bm{\phi}(\bm{x_i})\| & \text{if } k = y_{i},
  \end{cases}
  \label{eq:derivative_cos}
\end{align}}
where the derivative with respect to $\cos\tau_{y_{i}i}$ is negative and positive for $k \neq y_i$. As training progresses, $\cos\tau_{y_{i}i}$ tends to increase towards positivity, while $\cos\tau_{ki}$ for $k \neq y_i$ tends to become negative for each $i$. The derivative with respect to $\|\bm{v}_k\|$ is given by:
{\small
\begin{align}
    \frac{\partial L(\bm{f})}{\partial \|\bm{v}_k\|} &= \sum_{i=1}^N \left(\sum_{y_i \neq k} [\bm{\sigma}_{\text{SM}}\left( \bm{f}(\bm{x_i}) \right)]_{k}\|\bm{\phi}(\bm{x_i})\| \cos\tau_{ki} - \sum_{y_i = k} (1-[\bm{\sigma}_{\text{SM}}\left( \bm{f}(\bm{x_i}) \right)]_{y_i})\|\bm{\phi}(\bm{x_i})\| \cos\tau_{y_{i}i}\right).
    \label{eq:derivative_norm}
\end{align}}
%-----------------------------------------------------------------------------------------
Therefore, with adequate training and $\cos\tau_{ki} < 0$ and $\cos\tau_{y_{i}i} >0$, the derivative with respect to $\|\bm{v}_k\|$ is likely to become negative for each class $k$. The training of the model proceeds so that the empirical risk $L$ decreases, so the norm $\|\bm{v}_k\|$ tends to increase. This finding aligns with prior studies~\citep{soudry2018the, kim2020adjusting}.
%------------------------------------------------------------------------------------------
\paragraph{Remark: increase in classifier weight norms is more pronounced in LP than in FT}
In FT, particularly within an overparameterized setting, the model $\bm{f}$ may achieve perfect classification on the training dataset. That is, $[\bm{\sigma}_{\text{SM}}\left( \bm{f}(\bm{x}_i) \right)]_{k}$ becomes close to $0$ for $k \neq y_i$ and $1$ for $k = y_i$. In this scenario, the derivative in Eq.~\eqref{eq:derivative_norm} becomes close to zero, or the training itself is finished. Conversely, perfect classification is typically unattainable in LP unless the training dataset is linearly separable, so the derivative continues to be negative. In addition, while all parameters are updated in FT, only the classifier is optimized in LP, so the change in the classifier weight needs to be larger in LP than in FT to achieve the same classification performance. Consequently, the classifier weight norm tends to increase more significantly in LP than in FT, as shown in \Cref{fig:norm_increase} (c).
%------------------------------------------------------------------------------------------
\subsection{Training process of LoRA}
We extend our analysis based on the NTK to the training process of LoRA. We follow the linear model setting as in \Cref{dfn:linearModel} and analyze the training dynamics of LoRA in the NTK regime.
% 書き方から内容まで\citet{malladi2023kernel}に近いが大丈夫か？
\begin{prop}[LoRA approximates FT]
  \label{prop:lora}
  Consider the linear model setting (\Cref{dfn:linearModel}) and let $\bm{f}^{\text{LoRA}}$ and $\bm{f}^{\text{FT}}$ be the models obtained via one-epoch training with LoRA and standard FT in the NTK regime. Let $r$ denote the rank of the LoRA hyperparameter, and $\sigma^2$ represent the variance of the low-rank weight matrix initialization. Assume the input samples $\bm{x}$ satisfy $\|\bm{x}\| \leq c$. Then, for each sample pair $\bm{x}_i, \bm{x}_j \in \mathcal{X}$, the pre-train-effective components of the NTK matrix for LoRA and FT, $\bm{P}^{\text{LoRA}}(\bm{x}_i, \bm{x}_j)$ and $\bm{P}^{\text{FT}}(\bm{x}_i, \bm{x}_j)$, are identical:
  \begin{align}
    \bm{P}^{\text{LoRA}}(\bm{x}_i, \bm{x}_j) = \bm{P}^{\text{FT}}(\bm{x}_i, \bm{x}_j).
  \end{align}
  Moreover, with at least $1 - 4\exp(-(\epsilon^2 - \epsilon^3)r/4)$ probability, their FT-effective components, $\bm{F}^{\text{LoRA}}(\bm{x}_i, \bm{x}_j)$ and $\bm{F}^{\text{FT}}(\bm{x}_i, \bm{x}_j)$, satisfy:
  \begin{align}
    \|\bm{F}^{\text{LoRA}}(\bm{x}_i, \bm{x}_j) - \sigma^2 r \bm{F}^{\text{FT}}(\bm{x}_i, \bm{x}_j)\| \leq c\epsilon \|\bm{V}_0 \bm{V}_0^\top\|.
  \end{align}
\end{prop}
This proposition suggests that with high probability, the only difference of the NTK matrix between LoRA and standard FT is a scalar factor of the FT-effective component in the NTK matrix, and the scalar factor depends on the hyperparameters of LoRA. This implies that when the hyperparameters of LoRA are set appropriately, LoRA training is similar to standard FT training. This is consistent with the analysis by~\citet{malladi2023kernel}, where the NTK matrix of LoRA and standard FT are close to each other. It is important to note that the proposition is also valid for LP-FT and LP-LoRA (LP then LoRA). The proof of this proposition is included in the Appendix (\Cref{subsec:lora_ft}).
% This proposition is highly motivated by the Proposition D.2 of~\citet{malladi2023kernel}, which show that the NTK of LoRA and standard FT are close to each other.
%------------------------------------------------------------------------------------------
\subsection{Discussion}
\label{subsec:discussion}
An increased norm of the classifier weight reduces feature distortion and enhances the contribution of the FT-effective component of the NTK matrix during training. As a result, a higher classifier weight norm in LP-FT can be advantageous. However, since the increased norm is dependent on LP training, its optimality is not guaranteed. Specifically, during test time, although the increased classifier weight norm does not influence accuracy, it affects the calibration of the model. Calibration is defined as the alignment between the predicted probabilities and the actual probabilities~\citep{pmlr-v70-guo17a}. An excessively high classifier weight norm can lead to overconfident predictions, which might be detrimental in practical applications. Consequently, there is potential for refining LP-FT by adjusting the classifier weight norm to enhance calibration.

Tuning the norm of the classifier after training can be effectively equated to applying temperature scaling~\citep{pmlr-v70-guo17a} at test time. Temperature scaling adjusts the output logits with a temperature parameter $T$, thereby improving model calibration. Specifically, temperature scaling with parameter $T$, expressed as $\bm{f}(\bm{x})/T = \frac{\bm{V}}{T}\bm{\phi}(\bm{x}) + \frac{\bm{b}}{T}$, can be viewed as scaling down the classifier weight and bias by the factor $T$.
%------------------------------------------------------------------------------------------
\section{Numerical evaluation with transformer models}
In this section, we numerically justify the following aspects obtained from our analysis:
%------------------------------------------------------------------------------------------
\vspace{-0.6\baselineskip}
{
\setlength{\leftmargini}{10pt}
\begin{itemize}
  \setlength{\itemsep}{1pt}
	\setlength{\parskip}{0pt}
	\setlength{\itemindent}{0pt}
	\setlength{\labelsep}{5pt}
  \item The changes in features during training is smaller in LP-FT than in FT, and the norms of the classifier significantly increase during LP (\Cref{subsec:feature_change}).
  \item The FT-effective component of the NTK matrix more effectively captures the input data than the pre-train-effective component (\Cref{subsec:kernel_analysis}) and is more pronounced in LP-FT than FT.
  \item A large classifier weight norm reduces the feature change during training, and its negative effects on calibration can be improved by temperature scaling (\Cref{subsec:classifierAndTemperatureScaling}).
\end{itemize}
}
\vspace{-0.8\baselineskip}
Details on the datasets, setup, and additional results, including performance evaluations for the experimental and practical application, are available in the Appendix (\Cref{subsec:experimentalDetails,subsection:ExperimentAppendix}).
%------------------------------------------------------------------------------------------
\subsection{Setup}
\paragraph{Datasets and models}
We used a total of $13$ classification datasets from various benchmarks: SuperGLUE~\citep{wang2019superglue}, GLUE~\citep{wang2018glue}, BOSS~\citep{yuan2023revisiting}, and PubMed $20$k RCT~\citep{dernoncourt2017pubmed}. The breakdown of the datasets is as follows: five datasets from SuperGLUE (BoolQ, CB, RTE, WiC, and WSC), three datasets from GLUE (CoLA, MRPC, and SST-2), four datasets from BOSS (Amazon, Dynasent, SemEval, and SST-5), and PubMed $20$k RCT. Following experimental settings in studies that analyze FT dynamics from NTK perspectives~\citep{malladi2023kernel, jang2024lora} and the study with similar settings~\cite{chen-etal-2022-revisiting}, we employed the RoBERTa-base model~\citep{liu2020roberta} as our Transformer-based model.
% The GLUE and SuperGLUE benchmarks were chosen for their widespread use in the natural language processing community, BOSS was chosen for OOD evaluation, and the PubMed $20$k RCT dataset was chosen for validation in the practical setting.
%------------------------------------------------------------------------------------------
\paragraph{Implementation and training}
We used the Transformers library~\citep{wolf-etal-2020-transformers} and AdapterHub~\citep{pfeiffer-etal-2020-adapterhub} for our implementation. Our training protocol followed the experimental setup described by~\citet{chen-etal-2022-revisiting}. Hyperparameter tuning, especially for learning rates during the FT stage of LP-FT, was conducted through a grid search based on the validation set performance. For LP, we used logistic regression with L2 regularization on pre-trained features.
%------------------------------------------------------------------------------------------
\begin{table}[b]
  \caption{Feature (F) changes and classifier (C) norms on the CB and RTE datasets. CS, Diff, FDR, and Norm represent cosine similarity, difference norm, Fisher's discriminant ratio, and norm, respectively.}
  \centering
  \small
  {
  \fontsize{8.0pt}{6pt}\selectfont
\tabcolsep = 2pt
  \begin{tabular}{ccccccccc}
  \toprule
  \multirow{2}{*}{Method} & \multicolumn{4}{c}{CB} & \multicolumn{4}{c}{RTE} \\
    \cmidrule(r){2-5} \cmidrule(){6-9}
     & CS(F) & Diff(F) & FDR(F) & Norm(C) & CS(C) & Diff(F) & FDR(F) & Norm(C) \\
    \midrule
    Pre-trained & $0.997$ & $ - $ & $8.14 \times 10^{4}$ &   $ 9.51\times 10^{-1} $&$ 0.996 $&$ - $&$  8.59\times 10^{1}$&$ 7.76\times 10^{-1} $ \\
    LP          & $0.997$ & $ - $ & $8.14 \times 10^{4}$ & $ 2.48\times 10^{1} $ &$ 0.996 $&$ - $&$  8.59\times 10^{1}$& $ 3.10\times 10^{1} $ \\
    FT          & $0.336$ & $2.21 \times 10^{1}$ & $7.39 \times 10^{8}$ & $ 9.60\times 10^{-1} $ &$ 0.260 $&$ 2.16\times 10^{1} $&$  1.42\times 10^{4}$&$ 7.84\times 10^{-1} $ \\
    LoRA        & $0.499$ & $1.92 \times 10^{1}$ & $8.91 \times 10^{6}$ & $ 1.43\times 10^{0} $ &$ 0.759 $&$ 1.06\times 10^{1} $&$  2.97\times 10^{3}$&$ 1.21\times 10^{0} $  \\
    LP-FT       & $0.804$ & $1.20 \times 10^{1}$ & $6.47 \times 10^{6}$ & $ 2.48\times 10^{1} $ &$ 0.942 $&$ 4.70\times 10^{0} $&$  1.57\times 10^{2}$&$ 3.10\times 10^{1} $ \\
    LP-LoRA     & $0.837$ & $9.08 \times 10^{0}$ & $2.10 \times 10^{6}$ & $ 2.49\times 10^{1} $ &$ 0.924 $&$ 4.63\times 10^{0} $&$  2.06\times 10^{1}$ &$ 3.10\times 10^{1} $ \\
    \bottomrule
    \end{tabular}}
  \label{tab:feature_comparison}
\end{table}
%-------------------------------------------------------------------------------------------
%------------------------------------------------------------------------------------------
%------------------------------------------------------------------------------------------
\clearpage
\begin{multicols}{2}
  \noindent
  \begin{minipage}{\linewidth}
  \captionsetup[table]{hypcap=false}
    \centering
    \captionof{table}{Kernel statistics on the CB dataset. FN, Acc, and FT Ratio denote the Frobenius norm, kernel regression accuracy, and  contribution of the FT-effective component, respectively. Pre-train E and FT E refer to the pre-train-effective and FT-effective components of the NTK matrix.}
    \label{tab:kernel_statistics}
%   \begingroup
% \renewcommand{\arraystretch}{0.6}
\centering
  {
  \fontsize{7.5pt}{7pt}\selectfont
\tabcolsep = 2pt
\begin{tabular}{ccrrrr}
      \toprule
      Method & Kernel & Rank & FN (K) & Acc (train/test) & FT Ratio  \\
      \midrule
      - & {\tiny Pre-train E} & 18 & 51.0 & $87.11/79.17$ & - \\
      \midrule
      FT & FT E & 608 & 13.9  & $84.74/79.76$ & \multirow{2}{*}{$0.1987$} \\
      & NTK     & 210 & 64.9  & $84.74/79.76$ & \\
      \midrule
      LoRA & FT E & 500 & 0.0226   & $86.22/79.17$ & \multirow{2}{*}{$0.0004$} \\
      & NTK     & 20  & 51.0   & $92.15/84.52$ & \\
      \midrule
      LP-FT & FT E & 344 & 7250 & $100.00/86.31$ & \multirow{2}{*}{$1.0000$} \\
      & NTK     & 344 & 7280  & $100.00/86.31$ & \\
      \midrule
      LP-LoRA & FT E & 307 & 1.51 & $94.96/85.71$ & \multirow{2}{*}{$1.0137$} \\
      & NTK     & 188 & 62.6 & $95.11/85.71$ & \\
      \bottomrule
    \end{tabular}}
  % \endgroup
  \end{minipage}
  %------------------------------------------------------------------------------------------

  \noindent
  \begin{minipage}{\linewidth}
  \captionsetup[figure]{hypcap=false}
    \centering
    \includegraphics[width=\linewidth]{images/ntk/singularvalue_normalized_cb.png}
    \captionof{figure}{Singular value distribution normalized by the maximum value on the CB dataset, showing the common pre-train-effective component (Pre-train E) and the FT-effective components for each training option.}
\label{fig:ntk}
  \end{minipage}
\end{multicols}
%------------------------------------------------------------------------------------------
%------------------------------------------------------------------------------------------
\subsection{Small feature changes during LP-FT and significant norm increase during LP}
\label{subsec:feature_change}
LP-FT achieves notable performance with Transformer-based language models, outperforming standard FT in both ID and OOD settings, as detailed in Appendix (\Cref{subsection:idExperimentAppendix,subsection:oodExperiment}). To understand the underlying reasons for these results and validate small feature changes suggested by our analysis (\Cref{subsec:ntk_analysis}), we analyzed changes in both the classifier and the features.

According to statistics presented in \Cref{tab:feature_comparison}, the feature vectors of LP-FT demonstrate smaller changes from those of the pre-trained model compared to FT. Consequently, LP-FT maintains high cosine similarity among its features and exhibits a low Fisher's discriminant ratio (FDR)~\citep{fisher1936use}, which assesses linear separability. Conversely, the classifier norms after LP and LP-FT is substantially larger than those of the pre-trained model and after FT, suggesting a significant increase in classifier weights during LP. A similar trend is observed in training with LoRA.
%------------------------------------------------------------------------------------------
\subsection{Kernel analysis}
\label{subsec:kernel_analysis}
We examined the overall NTK matrix and its pre-train-effective and FT-effective components to understand their properties. Kernel regression was performed on the train and test sets to evaluate the performance of each kernel matrix.
%------------------------------------------------------------------------------------------
\paragraph{Analysis of NTK matrix components and effectiveness of LP-FT}
In~\Cref{tab:kernel_statistics}, the FT-effective component of the NTK matrix for LP-FT shows a higher rank and greater kernel regression accuracy compared to the pre-train-effective component, and the overall NTK matrix has intermediate properties. Additionally, the FT-effective component contributes more significantly to the overall kernel in LP-FT than in FT, as indicated by a higher FT Ratio. This ratio, calculated as the average of $\|\sum_{i=1}^N  \bm{F}(\bm{x}, \bm{x}_i) \bm{\delta}_i\| / \|\sum_{i=1}^N \left(\bm{P}(\bm{x}, \bm{x}_i) + \bm{F}(\bm{x}, \bm{x}_i)\right) \bm{\delta}_i\|$ for the train set samples, reflects the enhanced influence of the FT-effective component in LP-FT than in FT. These results suggest that the NTK matrix of LP-FT better captures input data through the increased influence of the FT-effective component.
%------------------------------------------------------------------------------------------
\paragraph{Similarities between LoRA and FT}
The ranks of the FT-effective components in LoRA and FT (or LP-LoRA and LP-FT) are similar, as indicated in~\Cref{tab:kernel_statistics}. Their distributions of singular values normalized by the maximum singular value, also closely align, as shown in~\Cref{fig:ntk}. These results suggest that the FT-effective components of the NTK matrix in FT and LoRA differ only by a scalar factor. This consistency demonstrates that our analysis (\Cref{subsec:ntk_analysis}), originally based on a two-layer linear model, is applicable to more complex Transformer-based models.
%------------------------------------------------------------------------------------------
\subsection{Analysis of classifier weight norms and temperature scaling}
\label{subsec:classifierAndTemperatureScaling}
We experimentally verified significant effects of classifier weight norms in training (\Cref{subsec:ntk_analysis}) and at test time (\Cref{subsec:discussion}) in the following.
%------------------------------------------------------------------------------------------
%------------------------------------------------------------------------------------------
\clearpage
\begin{multicols}{2}
  \noindent
  \begin{minipage}{\linewidth}
  \captionsetup[figure]{hypcap=false}
    \centering
    \includegraphics[width=\linewidth]{images/scaling/sst5_ood_norm_scale_feature}
    \captionof{figure}{Difference of features on SST-5 (OOD). The dashed vertical lines indicate the original classifier weight norm after training.}
    \label{fig:classifier_weight}
  \end{minipage}

  \noindent
  \begin{minipage}{\linewidth}
  \captionsetup[table]{hypcap=false}
    \centering
    \captionof{table}{ECE and MCE with temperature scaling on the test set of the RTE dataset. w/o TS and w/ TS denote without and with temperature scaling, respectively, and Imp. represents the improvement because of temperature scaling. We bold the best improvements.}
    \label{tab:temperature_scaling}
    \centering
    \small
  {
\tabcolsep = 3pt
\begin{tabular}{ccrrr}
  \toprule
  Metric & Method & w/o TS & w/ TS & Imp. \\
  \midrule
 \multirow{4}{*}{ECE (\%)}
         & FT & $21.16$ & $5.13$ & $16.03$ \\
       & LP-FT & $21.72 $ & $5.48 $ & $\bm{16.24}$ \\
       & LoRA & $11.92 $ & $6.17 $ & $5.76$ \\
       & LP-LoRA & $18.14 $ & $5.72$ & $12.42$ \\
  \midrule
 \multirow{4}{*}{MCE (\%)}
         & FT & $53.11 $ & $25.87$ & $27.24$ \\
       & LP-FT & $63.95$ & $13.94$ & $\bm{50.01}$ \\
       & LoRA & $25.04$ & $13.75$ & $11.29$ \\
       & LP-LoRA & $40.46 $ & $18.82$ & $21.63$ \\
  \bottomrule
\end{tabular}}
  \end{minipage}
  %------------------------------------------------------------------------------------------
\end{multicols}
%------------------------------------------------------------------------------------------
%------------------------------------------------------------------------------------------
\paragraph{Effects of classifier weight norms in training}
We scaled the classifier weight norms at the start of standard FT and the FT stage of LP-FT. The results, shown in \Cref{fig:classifier_weight}, indicate that larger classifier norms almost monotonically lead to smaller feature differences in both FT and LP-FT. Notably, LP-FT consistently shows smaller feature differences than FT, particularly when the classifier weight norm is large, validating our analysis that larger classifier weight norms reduce feature changes.
%------------------------------------------------------------------------------------------
\paragraph{Temperature scaling at test time}
We implemented temperature scaling at test time, which is equivalent to adjusting the classifier weight norms as discussed in~\Cref{subsec:discussion}. We optimized the temperature parameter on the validation sets based on CE loss, following the methodology suggested by \citet{pmlr-v70-guo17a}. \Cref{tab:temperature_scaling} presents the results on the RTE datasets. We assessed the expected calibration error (ECE) and maximum calibration error (MCE)~\citep{naeini2015obtaining}, which quantify the absolute differences between predicted and actual probabilities, with lower values indicating better calibration. These results show that the improvements in calibration by temperature scaling are the largest in LP-FT for both ECE and MCE, with notably substantial improvements in MCE. This suggests that large classifier weight norms contribute to substantial deviations in predictions in LP-FT, which can be effectively mitigated through temperature scaling. These results highlight the effectiveness of refining LP-FT by temperature scaling.
%------------------------------------------------------------------------------------------
\section{Conclusion}
In this paper, we explored the LP-FT training dynamics in complex classification models using the NTK to analyze feature changes. Our analysis identified classifier weight norms at the start of the FT stage as a key factor influencing FT dynamics. These norms balance the NTK matrix components and help reduce feature changes. Our findings support existing feature distortion theories from an NTK perspective and emphasize the role of classifier weight norms alongside prediction accuracy. We also found that an increase in classifier weight norms, characteristic of training with CE loss, may negatively impact model calibration. However, this can be mitigated by temperature scaling. Additionally, the approximation effectiveness of LoRA is theoretically validated in terms of the similarity of the NTK matrix components. Empirical experiments with Transformer-based language models supported our theoretical insights, validating our understanding of the NTK, feature changes, and the benefits of temperature scaling. Overall, our study substantiates the efficacy of LP-FT as a robust method for adapting pre-trained complex models while preserving their well-trained features.
\paragraph{Limitations}
The main limitation of our study is that it is based on the NTK regime, which might not fully capture the training dynamics. Additionally, our analysis is based on just one epoch of gradient descent in FT, which may not effectively represent the overall training. In our experiments, we specifically focused on validating the effectiveness of LP-FT on language models. Therefore, areas other than natural language processing are outside the scope of this study.
%----------------------------------------------------------------------------------------
% \clearpage
\begin{ack}
  Use unnumbered first level headings for the acknowledgments. All acknowledgments
  go at the end of the paper before the list of references. Moreover, you are required to declare
  funding (financial activities supporting the submitted work) and competing interests (related financial activities outside the submitted work).
  More information about this disclosure can be found at: \url{https://neurips.cc/Conferences/2024/PaperInformation/FundingDisclosure}.


  Do {\bf not} include this section in the anonymized submission, only in the final paper. You can use the \texttt{ack} environment provided in the style file to automatically hide this section in the anonymized submission.
\end{ack}
\bibliographystyle{plainnat}
\bibliography{myref}
\section{Appendix}
\input{appendix.tex}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newpage
\section*{NeurIPS Paper Checklist}

\begin{enumerate}

\item {\bf Claims}
    \item[] Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?
    \item[] Answer: \answerYes{} % Replace by \answerYes{}, \answerNo{}, or \answerNA{}.
    \item[] Justification: We claim that our paper analyzes a fine-tuning method, specifically linear probing then fine-tuning, from a neural tangent kernel perspective. The abstract succinctly summarizes the main contributions, and the introduction provides a thorough overview of the paper's scope with our motivation.

    \item[] Guidelines:
    \begin{itemize}
        \item The answer NA means that the abstract and introduction do not include the claims made in the paper.
        \item The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.
        \item The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.
        \item It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.
    \end{itemize}

\item {\bf Limitations}
    \item[] Question: Does the paper discuss the limitations of the work performed by the authors?
    \item[] Answer: \answerYes{} % Replace by \answerYes{}, \answerNo{}, or \answerNA{}.
    \item[] Justification: We explicitly discuss the limitations of our theoretical analysis in the limitations section of our paper, highlighting the need for further investigations.
    \item[] Guidelines:
    \begin{itemize}
        \item The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.
        \item The authors are encouraged to create a separate "Limitations" section in their paper.
        \item The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.
        \item The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.
        \item The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.
        \item The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.
        \item If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.
        \item While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.
    \end{itemize}

\item {\bf Theory Assumptions and Proofs}
    \item[] Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
    \item[] Answer: \answerYes{} % Replace by \answerYes{}, \answerNo{}, or \answerNA{}.
    \item[] Justification: We clearly state our assumptions alongside the propositions and provide complete proofs in the appendix. This ensures that our theoretical results are well-supported and verifiable.
    \item[] Guidelines:
    \begin{itemize}
        \item The answer NA means that the paper does not include theoretical results.
        \item All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.
        \item All assumptions should be clearly stated or referenced in the statement of any theorems.
        \item The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.
        \item Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.
        \item Theorems and Lemmas that the proof relies upon should be properly referenced.
    \end{itemize}

    \item {\bf Experimental Result Reproducibility}
    \item[] Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
    \item[] Answer: \answerYes{} % Replace by \answerYes{}, \answerNo{}, or \answerNA{}.
    \item[] Justification: We include all essential details needed to replicate our main experimental results within the paper. This includes hyperparameters and data splits to ensure that our findings are reproducible.

    \item[] Guidelines:
    \begin{itemize}
        \item The answer NA means that the paper does not include experiments.
        \item If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.
        \item If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.
        \item Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.
        \item While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example
        \begin{enumerate}
            \item If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.
            \item If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.
            \item If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).
            \item We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.
        \end{enumerate}
    \end{itemize}


\item {\bf Open access to data and code}
    \item[] Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
    \item[] Answer: \answerYes{} % Replace by \answerYes{}, \answerNo{}, or \answerNA{}.
    \item[] Justification: We will release the code and associated data before the review process begins. This release will include comprehensive instructions to ensure faithful reproduction of our experimental results.
    \item[] Guidelines:
    \begin{itemize}
        \item The answer NA means that paper does not include experiments requiring code.
        \item Please see the NeurIPS code and data submission guidelines (\url{https://nips.cc/public/guides/CodeSubmissionPolicy}) for more details.
        \item While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).
        \item The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (\url{https://nips.cc/public/guides/CodeSubmissionPolicy}) for more details.
        \item The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.
        \item The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.
        \item At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).
        \item Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.
    \end{itemize}


\item {\bf Experimental Setting/Details}
    \item[] Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
    \item[] Answer: \answerYes{} % Replace by \answerYes{}, \answerNo{}, or \answerNA{}.
    \item[] Justification: We detail all necessary training parameters, including data splits and hyperparameters, to ensure that our experimental results can be faithfully reproduced.
    \item[] Guidelines:
    \begin{itemize}
        \item The answer NA means that the paper does not include experiments.
        \item The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.
        \item The full details can be provided either with the code, in appendix, or as supplemental material.
    \end{itemize}

\item {\bf Experiment Statistical Significance}
    \item[] Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
    \item[] Answer: \answerYes{} % Replace by \answerYes{}, \answerNo{}, or \answerNA{}.
    \item[] Justification: We include error bars and standard deviations in our results where applicable, ensuring that the statistical significance of our findings is clear and well-documented.
    \item[] Guidelines:
    \begin{itemize}
        \item The answer NA means that the paper does not include experiments.
        \item The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.
        \item The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).
        \item The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)
        \item The assumptions made should be given (e.g., Normally distributed errors).
        \item It should be clear whether the error bar is the standard deviation or the standard error of the mean.
        \item It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96\% CI, if the hypothesis of Normality of errors is not verified.
        \item For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).
        \item If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.
    \end{itemize}

\item {\bf Experiments Compute Resources}
    \item[] Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?
    \item[] Answer: \answerYes{} % Replace by \answerYes{}, \answerNo{}, or \answerNA{}.
    \item[] Justification: Detailed descriptions of the computational resources used, including hardware specifics and implementation details, are provided in the Appendix to aid in reproducing our experiments.
    \item[] Guidelines:
    \begin{itemize}
        \item The answer NA means that the paper does not include experiments.
        \item The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.
        \item The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.
        \item The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper).
    \end{itemize}

\item {\bf Code Of Ethics}
    \item[] Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics \url{https://neurips.cc/public/EthicsGuidelines}?
    \item[] Answer: \answerYes{} % Replace by \answerYes{}, \answerNo{}, or \answerNA{}.
    \item[] Justification: After thoroughly reviewing the NeurIPS Code of Ethics, we confirm that our research adheres to all the specified guidelines.
    \item[] Guidelines:
    \begin{itemize}
        \item The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
        \item If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.
        \item The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).
    \end{itemize}


\item {\bf Broader Impacts}
    \item[] Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
    \item[] Answer: \answerNA{} % Replace by \answerYes{}, \answerNo{}, or \answerNA{}.
    \item[] Justification: Given the theoretical nature of our work, we assess that it does not directly engage with societal impacts.
    \item[] Guidelines:
    \begin{itemize}
        \item The answer NA means that there is no societal impact of the work performed.
        \item If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.
        \item Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.
        \item The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.
        \item The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.
        \item If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).
    \end{itemize}

\item {\bf Safeguards}
    \item[] Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
    \item[] Answer: \answerNA{} % Replace by \answerYes{}, \answerNo{}, or \answerNA{}.
    \item[] Justification: Our research does not involve the release of data or models that pose high risks for misuse, hence specific safeguards are not required.
    \item[] Guidelines:
    \begin{itemize}
        \item The answer NA means that the paper poses no such risks.
        \item Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.
        \item Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.
        \item We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
    \end{itemize}

\item {\bf Licenses for existing assets}
    \item[] Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
    \item[] Answer: \answerNA{} % Replace by \answerYes{}, \answerNo{}, or \answerNA{}.
    \item[] Justification: Our study does not use any external assets, thus no licensing or attribution issues are applicable.
    \item[] Guidelines:
    \begin{itemize}
        \item The answer NA means that the paper does not use existing assets.
        \item The authors should cite the original paper that produced the code package or dataset.
        \item The authors should state which version of the asset is used and, if possible, include a URL.
        \item The name of the license (e.g., CC-BY 4.0) should be included for each asset.
        \item For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.
        \item If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, \url{paperswithcode.com/datasets} has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.
        \item For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.
        \item If this information is not available online, the authors are encouraged to reach out to the asset's creators.
    \end{itemize}

\item {\bf New Assets}
    \item[] Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
    \item[] Answer: \answerNA{} % Replace by \answerYes{}, \answerNo{}, or \answerNA{}.
    \item[] Justification: No new assets are introduced in our paper, so there are no associated documentation requirements.
    \item[] Guidelines:
    \begin{itemize}
        \item The answer NA means that the paper does not release new assets.
        \item Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.
        \item The paper should discuss whether and how consent was obtained from people whose asset is used.
        \item At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.
    \end{itemize}

\item {\bf Crowdsourcing and Research with Human Subjects}
    \item[] Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
    \item[] Answer: \answerNA{} % Replace by \answerYes{}, \answerNo{}, or \answerNA{}.
    \item[] Justification: Our paper does not involve crowdsourcing nor research with human subjects.
    \item[] Guidelines:
    \begin{itemize}
        \item The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
        \item Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.
        \item According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.
    \end{itemize}

\item {\bf Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects}
    \item[] Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
    \item[] Answer: \answerNA{} % Replace by \answerYes{}, \answerNo{}, or \answerNA{}.
    \item[] Justification: Our paper does not involve crowdsourcing nor research with human subjects.
    \item[] Guidelines:
    \begin{itemize}
        \item The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
        \item Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.
        \item We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.
        \item For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.
    \end{itemize}

\end{enumerate}
\end{document}
