Keywords: Foundation Models, Vision Language Model, Robustness, Mode connectivity, OOD, Interpolation
TL;DR: RFT or linear interpolation doesn't always enhance OOD accuracy over the zero-shot model. We address this by analyzing the relationship between the linear path's geometry and CLIP's OOD generalization.
Abstract: $\textit{Zero-shot}$ models like CLIP are often fine-tuned on a target dataset to improve its accuracy further, but this can compromise out-of-distribution (OOD) robustness. Robust Fine-Tuning ($\texttt{RFT}$
), which interpolates between the $\textit{zero-shot}$ and fine-tuned models, has been proposed to address this issue. However, understanding when $\texttt{RFT}$ actually improves OOD error remains limited. In this work, we empirically investigate the robustness of $\texttt{RFT}$ in CLIP models, focusing on two key factors: 1) the $\textit{presence}$ or $\textit{absence}$ of barriers in the interpolation path between the zero-shot and fine-tuned models, and 2) fine-tuning choices such as data augmentation and learning rate magnitude. Our extensive experiments reveal that the $\textit{absence}$ of barriers correlates with larger gains in OOD accuracy for $\texttt{RFT}$. Additionally, we show that fine-tuning without data augmentation and using smaller learning rates consistently results in lower OOD errors. While similar findings have been reported for CNN models, this is the first work, to the best of our knowledge, to study these properties for CLIP models.
Submission Number: 95
Loading