Keywords: Vision language Models, CLIP, Representation Learning, Corruption Detection
Abstract: Large pre-trained models like CLIP exhibit an object-centric bias, rendering them brittle for tasks like assessing robustness to common image corruptions. We hypothesize that this stems from low-information classification objectives that fail to learn robust, structural representations. To overcome this, we propose Corruption-Guided Finetuning (CGF), which regularizes the model by introducing a dense auxiliary task: predicting pixel-wise corruption maps. We introduce a principled three-stage curriculum learning strategy to effectively integrate this dense objective with the global classification task. Our model, CG-CLIP, improves out-of-distribution corruption detection accuracy on the challenging Caltech-256 benchmark from 88.0\% to 97.45\%, a $\sim9$ point gain over a strong baseline, FLYP. This improvement is achieved with no additional inference overhead, as the auxiliary components are discarded after training. Our work shows that compelling models to solve richer, structurally-aware tasks is a direct path to more robust and generalizable AI.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9257
Loading