Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models
Keywords: Vision-Language Models (VLMs), Task-Interfering Layers, Test-Time Adaptation, Training-Free Adaptation
TL;DR: We find bypassing certain VLM layers improves task performance. We propose TaLo, a training-free method that boosts accuracy by skipping the most task-interfering layer at inference without any parameter updates.
Abstract: Current Vision-Language Models (VLMs) have demonstrated remarkable capabilities across a wide range of multimodal tasks. Typically, in a pretrained VLM, all layers are engaged by default to make predictions on downstream tasks. Surprisingly, we find that intervening on a single layer, such as by zeroing its parameters, can improve the performance on certain tasks, indicating that some layers hinder rather than help downstream tasks. To understand when and why this occurs, we systematically investigate how individual layers influence different tasks via layer intervention (e.g., parameter zeroing). Specifically, we measure the change in performance relative to the base model after intervening on each layer and observe improvements when bypassing specific layers. This improvement can be generalizable across models and datasets, indicating the presence of **Task-Interfering Layers** that harm downstream tasks' performance. To further analyze this phenomenon, we introduce **Task-Layer Interaction Vector**, which quantifies the effect of intervening on each layer of a VLM given a task. Crucially, these task-interfering layers exhibit task-specific sensitivity patterns: tasks requiring similar capabilities show consistent response trends under layer interventions, as evidenced by the high similarity in their task-layer interaction vectors. Inspired by these findings, we propose TaLo (**Ta**sk-Adaptive **L**ayer Kn**o**ckout), a training-free, test-time adaptation method that dynamically identifies and bypasses the most interfering layer for a given task. Without any parameter updates, TaLo consistently improves performance across various models and datasets, even boosting Qwen-VL’s accuracy on the Maps task in ScienceQA by up to 16.6%. Our work reveals an unexpected form of modularity in pretrained VLMs and provides a plug-and-play, training-free mechanism to unlock hidden capabilities at inference time. The source code will be publicly available.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 825
Loading