Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting
TL;DR: Forgetting-free fine-tuning of vision foundation models via Proxy-based Feature Distribution Alignment (Proxy-FDA)
Abstract: Vision foundation models pre-trained on massive data encode rich representations of real-world concepts, which can be adapted to downstream tasks by fine-tuning. However, fine-tuning foundation models on one task often leads to the issue of *concept forgetting* on other tasks. Recent methods of robust fine-tuning aim to mitigate forgetting of prior knowledge without affecting the fine-tuning performance. Knowledge is often preserved by matching the original and fine-tuned model weights or feature pairs. However, such point-wise matching can be too strong, without explicit awareness of the feature neighborhood structures that encode rich knowledge as well. We propose a novel regularization method **Proxy-FDA** that explicitly preserves the structural knowledge in feature space. Proxy-FDA performs Feature Distribution Alignment (using nearest neighbor graphs) between the pre-trained and fine-tuned feature spaces, and the alignment is further improved by informative proxies that are generated dynamically to increase data diversity. Experiments show that Proxy-FDA significantly reduces concept forgetting during fine-tuning, and we find a strong correlation between forgetting and a distributional distance metric (in comparison to L2 distance). We further demonstrate Proxy-FDA's benefits in various fine-tuning settings (end-to-end, few-shot and continual tuning) and across different tasks like image classification, captioning and VQA.
Lay Summary: Large AI models for vision tasks learn general knowledge about real-world concepts by training on massive image datasets. When these models are later "fine-tuned" for a specific task (like classifying birds), they often "forget" knowledge they previously learned, which makes them worse at handling other tasks — a problem known as concept forgetting. Existing approaches try to prevent this problem by forcing the model’s internal feature representations to stay close to the original ones on individual images, which can be overly restrictive. This paper instead preserves the overall structure of image features — how different pieces of knowledge relate to each other in the feature space. Our new method does so by aligning the overall feature distributions before and after fine-tuning, and it dynamically synthesize "virtual" features to add variety during this alignment process. Our approach is shown to significantly reduce concept forgetting in many settings, including when fine-tuning with a lot of data, just a few examples, or over time with multiple tasks. It also performs well across various tasks, like recognizing images, generating image captions, or answering visual questions.
Primary Area: General Machine Learning->Transfer, Multitask and Meta-learning
Keywords: Proxy-FDA, robust fine-tuning, concept forgetting, vision foundation model
Submission Number: 8248
Loading