Keywords: Catastrophic Forgetting, Vision-Language Models, Reinforcement Learning, GRPO, Supervised Fine-Tuning, Vision Encoder, Continual Adaptation
TL;DR: On-policy GRPO mitigates hyperparameter-sensitive forgetting in VLM fine-tuning and better preserves vision-encoder representations than SFT.
Abstract: Repeated fine-tuning of Vision-Language Models (VLMs) risks catastrophic forgetting of prior capabilities.
Recent studies suggest on-policy RL methods such as GRPO mitigate this risk better than Supervised Fine-Tuning (SFT), but a systematic characterization in VLMs is lacking.
We compare SFT and GRPO by fine-tuning Qwen2.5-VL-Instruct on CIFAR-10 and CIFAR-100, varying learning rates and training durations to characterize the key drivers of forgetting.
GRPO consistently exhibits minimal forgetting, while SFT forgetting is highly sensitive to hyperparameters: high learning rates improve in-domain accuracy but induce catastrophic cross-domain forgetting.
To probe whether forgetting reflects encoder-level representational damage, we fine-tune Qwen2-VL on a spatial reasoning task and isolate the vision encoders for retraining.
The SFT encoder underperforms the pre-trained encoder even after retraining---suggesting encoder changes not fully recovered under our protocol---while the GRPO encoder retains comparable performance.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 22
Loading