Keywords: Vision-Language Models, Action Understanding
TL;DR: We introduce visual probes that interact with intermediate visual encoder layers to improve domain adaptation in VLMs
Abstract: Large Vision-Language Models (VLMs) excel at general visual reasoning tasks, but their performance degrades sharply when deployed in novel domains that exhibit substantial distribution shifts relative to pretraining data. Existing approaches for adapting VLMs to new domains typically rely on finetuning standard VLM components. Depending on which components are updated, these methods either restrict the model’s ability to learn domain-specific features or cause catastrophic forgetting of previously acquired capabilities. We introduce Vision Contextualized Probing (VisCoP), a method that augments a VLM’s vision encoder with a compact set of learnable visual probes. These probes enable the model to acquire domain-specific visual representations while requiring only minimal updates to pretrained VLM components. We evaluate VisCoP across three challenging domain adaptation settings: cross-view (exocentric → egocentric), cross-modal (RGB → depth), and cross-task (human understanding → robot control). Experiments show that VisCoP consistently outperforms existing domain adaptation strategies, achieving stronger performance on target domains while better preserving capabilities from the source domain. Code, models, and evaluation protocols are released at https://github.com/dominickrei/VisCoP.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 26
Loading