On-Policy Adaptation Mitigates Hyperparameter-Sensitive Forgetting in Vision-Language Models

Published: 23 May 2026, Last Modified: 23 May 2026CATS@ICML26 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Catastrophic Forgetting, Vision-Language Models, Reinforcement Learning, GRPO, Supervised Fine-Tuning, Vision Encoder, Continual Adaptation
TL;DR: On-policy GRPO mitigates hyperparameter-sensitive forgetting in VLM fine-tuning and better preserves vision-encoder representations than SFT.
Abstract: Repeated fine-tuning of Vision-Language Models (VLMs) risks catastrophic forgetting of prior capabilities. Recent studies suggest on-policy RL methods such as GRPO mitigate this risk better than Supervised Fine-Tuning (SFT), but a systematic characterization in VLMs is lacking. We compare SFT and GRPO by fine-tuning Qwen2.5-VL-Instruct on CIFAR-10 and CIFAR-100, varying learning rates and training durations to characterize the key drivers of forgetting. GRPO consistently exhibits minimal forgetting, while SFT forgetting is highly sensitive to hyperparameters: high learning rates improve in-domain accuracy but induce catastrophic cross-domain forgetting. To probe whether forgetting reflects encoder-level representational damage, we fine-tune Qwen2-VL on a spatial reasoning task and isolate the vision encoders for retraining. The SFT encoder underperforms the pre-trained encoder even after retraining---suggesting encoder changes not fully recovered under our protocol---while the GRPO encoder retains comparable performance.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 22
Loading