FlashVLM: Exploiting Layer Redundancy via Visual Attention for Efficient Vision-Language Inference

Sreetama Sarkar; Saket Gurukar; Yuexi Zhang; Monami Banerjee; Ashwin Chandra

FlashVLM: Exploiting Layer Redundancy via Visual Attention for Efficient Vision-Language Inference

Sreetama Sarkar, Saket Gurukar, Yuexi Zhang, Monami Banerjee, Ashwin Chandra

Published: 11 Nov 2025, Last Modified: 16 Jan 2026DAI PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision language models, efficiency, layer skipping, visual attention

TLDR: Layer skipping in Vision Language Models using Visual Attention

Abstract: Vision Language Models (VLMs) are rapidly advancing in multimodal understanding, largely due to increases in training data and model size. However, the computational demands of VLMs are also increasing, presenting challenges for their deployment in resource-constrained environments like smart televisions (TVs). VLMs are particularly useful in smart TVs for applications such as AI summarization, content question-and-answer (QnA), and user interface (UI) understanding and navigation. Previous research has focused on leveraging layer redundancy to accelerate VLM inference, but these methods often depend on extensive training, which requires significant computational resources and time. Recently, some inference-based solutions for layer skipping in Large Language Models (LLMs) have been proposed. In this paper, we demonstrate that the metrics used for layer skipping in LLMs do not always yield favorable results when applied to VLMs. We introduce an inference-only layer skipping strategy for VLMs based on attention to image tokens. Our approach increases overall throughput by up to 21% while maintaining close to baseline performance.

Submission Number: 15

Loading