Abstract: Vision-language models (VLMs) excel at tasks like visual question answering and image captioning, but their reliance on frozen, pretrained image encoders like CLIP often leads to persistent vision errors that degrade downstream performance. Moreover, real-world deployment demands that VLMs continually adapt to new, scarce data in a few-shot setting without forgetting prior knowledge. To meet these challenges, we introduce LoRSU (Low-Rank Adaptation with Structured Updates), a lightweight and robust technique for few-shot continual learning of VLMs’ image encoders. Our approach leverages theoretical insights to identify and update only the most critical parameters, achieving significant resource efficiency. Specifically, we demonstrate that LoRSU reduces computational overhead by over 25x compared to full VLM updates, without sacrificing performance. In experiments on VQA benchmarks under a few-shot continual learning protocol, LoRSU demonstrates superior scalability, efficiency, and accuracy, offering a practical solution for dynamic, resource-constrained vision-language applications.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Soma_Biswas1
Submission Number: 5539
Loading