Keywords: vision language models, efficiency, layer skipping, visual attention
TLDR: Layer skipping in Vision Language Models using Visual Attention
Abstract: Vision Language Models (VLMs) are rapidly advancing in multimodal understanding, largely due to increases in training data and model size. However, the computational demands of VLMs are also increasing, presenting challenges for their deployment in resource-constrained environments like smart televisions (TVs). VLMs are particularly useful in smart TVs for applications such as AI summarization, content question-and-answer (QnA), and user interface (UI) understanding and navigation. Previous research has focused on leveraging layer redundancy to accelerate VLM inference, but these methods often depend on extensive training, which requires significant computational resources and time. Recently, some inference-based solutions for layer skipping in Large Language Models (LLMs) have been proposed. In this paper, we demonstrate that the metrics used for layer skipping in LLMs do not always yield favorable results when applied to VLMs. We introduce an inference-only layer skipping strategy for VLMs based on attention to image tokens. Our approach increases overall throughput by up to 21% while maintaining close to baseline performance.
Submission Number: 15
Loading