Keywords: Visual Prompting Large Vision Model, Efficient In-context Learning, Vision Transformers
Abstract: Visual prompt-based large vision models exhibit remarkable performance in a range of vision tasks. However, visual prompting large vision models are computationally intensive and resource-demanding due to their large parameter sizes and the complexity of processing visual prompts, resulting in inefficiencies in speed and memory usage. To tackle these challenges, we propose the Efficient Painter model, which leverages a novel context-aggregated attention based trident block to alleviate cross-task gaps and reduce memory and computation overhead. Furthermore, we introduce a cross-blocks feature union module to capture global contextual information at different levels and speed up training. This architecture mitigates training costs and memory requirements during inference. Our model strikes a balance between speed and memory efficiency, achieving a 19$\times$ reduction in FLOPs. Moreover, our model is 9$\times$ smaller in model size and runs 4.1$\times$ and 27$\times$ faster during training and inference, respectively. Comprehensive experiments demonstrate that our design effectively processes additional visual prompts and outperforms baseline methods on standard benchmarks like \textit{SIDD} and \textit{LoL} in zero-shot settings, improving performance by 0.4\% and 1.2\% respectively.
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11795
Loading