Towards Efficient Vision-Language Tuning: More Information Density, More Generalizability

Tianxiang Hao; Mengyao Lyu; Hui Chen; Sicheng Zhao; Xiaohan Ding; Jungong Han; Guiguang Ding

Towards Efficient Vision-Language Tuning: More Information Density, More Generalizability

Tianxiang Hao, Mengyao Lyu, Hui Chen, Sicheng Zhao, Xiaohan Ding, Jungong Han, Guiguang Ding

27 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models, Parameter-Efficient Tuning, Information Density, Generalizability

Abstract: With the advancement of large pre-trained vision-language models, effectively transferring the knowledge embedded within these foundational models to downstream tasks has become a pivotal topic, particularly in data-scarce environments. Recently, parameter-efficient fine-tuning approaches, especially prompt tuning, have garnered considerable attention. To better understand the nature of prompt tuning, we propose the concept of ``Information Density'' (ID) to indicate whether a matrix strongly belongs to certain feature spaces rather than being evenly distributed across various feature spaces. We suppose a higher ID with strong bias across some feature spaces naturally leads to excellent robustness and stability. Our research, inspired by the observation that generalizability is closely linked to the information density of the prompt embedding, introduces the Dense Information Prompt (DIP). DIP aims to enhance information density to improve generalization. Several alternative algorithms to increase ID are proposed and verified effective. With further help of proper initialization and regularization, comprehensive experiments substantiate the superiority of DIP. Notably, DIP surpasses the latest state-of-the-art methods by a substantial margin with an exceptionally small parameter count and no extra inference overhead. Across a range of tasks spanning 11 datasets, DIP improves the average downstream accuracy of classic prompt tuning by up to 5.76\%.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11785

Loading