POMP: A Theoretical Approach to Mitigate Forgetting in Finetuning Multi-Modal Models

06 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-Modal Models, Robust Fine-tuning, Out-of-Distribution Generalization
Abstract: Catastrophic forgetting is a major challenge when adapting pretrained models to new tasks in multi-modal contrastive learning (MMCL). We provide a theoretical analysis of finetuning by introducing a *contrastive target matrix* that reformulates the linearized contrastive objective as a matrix least-squares problem. This formulation yields closed-form solutions for direct finetuning, weight-space regularization, and self-distillation, providing a geometric interpretation of how each strategy manages pretrained knowledge. Our analysis reveals that self-distillation preserves knowledge in the subspace orthogonal to the finetuning data while forming a convex combination of the pretrained and new solutions within the task subspace. We extend this analysis to a dynamic self-distillation framework with a weighted moving average (WMA) teacher. We prove that, unlike standard Exponential Moving Average (EMA) teachers which eventually collapse onto the student, the WMA teacher maintains a persistent, non-vanishing regularizing force throughout training by integrating the full optimization trajectory. These theoretical insights motivate our method, **POMP** (Preserve-Orthogonal-Mix-Parallel), which operationalizes this framework. POMP uses a composite distillation loss guided by the WMA teacher to achieve state-of-the-art out-of-distribution robustness and calibration when finetuning CLIP.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 2618
Loading