VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models

ICLR 2026 Conference Submission14425 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision language models, video understanding, egocentric vision
Abstract: Large Vision Language Models (VLMs) excel at general visual reasoning tasks, but their performance degrades sharply when deployed in novel domains with substantial distribution shifts compared to what was seen during pretraining. Existing approaches to adapt VLMs to novel target domains rely on finetuning standard VLM components. Depending on which components are finetuned, these approaches either limit the VLM’s ability to learn domain-specific features or lead to catastrophic forgetting of pre-existing capabilities. To address this, we introduce **Vis**ion **Co**ntextualized **P**robing (**VisCoP**), which augments the VLM's vision encoder with a compact set of learnable *visual probes*, enabling domain-specific features to be learned with only minimal updates to the pretrained VLM components. We evaluate VisCoP across three challenging domain adaptation scenarios: cross-view (exocentric → egocentric), cross-modal (RGB → depth), and cross-task (human understanding → robot control). Our experiments demonstrate that VisCoP consistently outperforms existing domain adaptation strategies, achieving superior performance on the target domain while better retaining capabilities from the source domain. We will release all code, models, and evaluation protocols to facilitate future research in VLM domain adaptation.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 14425
Loading