Keywords: large vision-language models, multimodal learning, continual learning
Abstract: Vision-language models (VLMs) demonstrate impressive capabilities in visual question answering and image captioning, acting as a crucial link between visual and language modalities. However, existing open-source VLMs rely heavily on pretrained vision encoders, such as CLIP. Despite CLIP’s robustness across diverse domains, it still exhibits significant image understanding errors. These errors propagate to the VLM responses, resulting in sub-optimal performance. In our work, we propose an efficient and robust method for updating vision encoders within VLMs. Our approach selectively and locally updates the model parameters, leading to substantial performance improvements on data where previous mistakes occurred, while maintaining overall robustness. We demonstrate the effectiveness of our method during offline and continual few-shot updates, simulating a model editing regime for VLMs. While our method also scales efficiently and effectively to adapting the language model (LLM) component of the VLM, we show that separately updating the vision encoder can be a very efficient alternative. This approach improves VLM performance with less than 10x the compute resources required for updating the LLM. Our method is also supported by theoretical justifications on the parameter selection strategy.
Primary Area: transfer learning, meta learning, and lifelong learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 14086
Loading