Quantified Task Misalignment to Inform PEFT: An Exploration of Domain Generalization and Catastrophic Forgetting in CLIP
Keywords: Multi-modal models, foundation models, PEFT, embedding alignment, CLIP, catastrophic forgetting, generalization
Abstract: Foundations models are presented as generalists that often perform well over a myriad of tasks. Fine-tuning these models, even on limited data, provides an additional boost in task-specific performance but often at the cost of their wider generalization, an effect termed catastrophic forgetting. In this paper, we analyze the relation between zero-shot text and image embedding alignment in the CLIP model and the performance of several simple parameter-efficient fine-tuning methods through the lens of domain generalization and catastrophic forgetting. We provide evidence that the silhouette score of the zero-shot image and text embeddings is a better measure of improvement gain from fine-tuning than the average cosine similarity of correct image/label embeddings, and discuss empirical relationships between zero-shot embedding alignments, fine-tuning method, domain generalization, and catastrophic forgetting. Additionally, the averaged results across tasks and performance measures demonstrate that a simplified method that trains only a subset of attention weights, which we call A-CLIP, provides a good balance between domain generalization and catastrophic forgetting.
Submission Number: 27
Loading