Beyond the Final Layer: Attentive Multi-Layer Fusion for Vision Transformers

Beyond the Final Layer: Attentive Multi-Layer Fusion for Vision Transformers

ICLR 2026 Conference Submission19300 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Probing, Vision Transformers, Intermediate layer fusion, Transfer Learning, Representation Learning, Parameter efficiency

TL;DR: Intermediate layer fusion with attention outperforms standard probing by capturing complementary features across the network hierarchy.

Abstract: With the rise of large-scale foundation models, efficiently adapting them to downstream tasks remains a central challenge. Linear probing, which freezes the backbone and trains a lightweight head, is computationally efficient but often restricted to last-layer representations. We show that task-relevant information is distributed across the network hierarchy rather than solely encoded in any of the last layers. To leverage this distribution of information, we apply an attentive probing mechanism that dynamically fuses representations from all layers of a Vision Transformer. This mechanism learns to identify the most relevant layers for a target task and combines low-level structural cues with high-level semantic abstractions. Across 19 diverse datasets and multiple pretrained foundation models, our method achieves consistent, substantial gains over standard linear probes. Attention heatmaps further reveal that tasks different from the pre-training domain benefit most from intermediate representations. Overall, our findings underscore the value of intermediate layer information and demonstrate a principled, task-aware approach for unlocking their potential in probing-based adaptation.

Supplementary Material: zip

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 19300

Loading