In Search of the $\textit{Successful}$ Interpolation: On the Role of $\textit{Sharpness}$ in CLIP Generalization
Keywords: Sharpness, Interpolation, Robust Fine-tuning, Interpretable AI, Foundation Model, CLIP
Abstract: $\textit{Zero-shot}$ models like CLIP are often fine-tuned on a target dataset to improve its accuracy further, but this can compromise out-of-distribution (OOD) robustness. Robust Fine-Tuning ($\texttt{RFT}$
), which interpolates between the $\textit{zero-shot}$ and $\textit{fine-tuned}$ models, has been proposed to address this issue. However, understanding when $\texttt{RFT}$ actually improves OOD error remains limited. In this work, we empirically investigate the robustness of $\texttt{RFT}$ in CLIP models, with a focus on the $\textit{sharpness}$ of the CLIP model during interpolation. First, we demonstrate that while sharpness may not serve as a reliable indicator for predicting the generalization of modern architectures like CLIP on OOD data, this challenges the conventional belief in the generalization benefits of flat minima in foundation models. However, by examining the role of the $\textit{straggler layer}$ phenomenon, we show that, unlike overall sharpness, the $\textit{layer-wise}$ sharpness of $\textit{straggler}$ layers can reliably capture the generalization performance of interpolated CLIP models on OOD data.
Our extensive experiments reveal that $\textit{layer-wise}$ sharpness correlates with generalization in OOD accuracy for $\texttt{RFT}$. Furthermore, we demonstrate that by inducing sparsity in the $\textit{straggler}$ layers, we can mitigate the $\textit{failure mode}$ phenomenon in $\texttt{RFT}$. To the best of our knowledge, this is the first work to study the role of sharpness in the $\textit{success}$ of interpolation in the weight space of CLIP foundation models.
Submission Number: 30
Loading