Instruction-guided fusion of multi-layer visual features in Large Vision-Language Models

Published: 2026, Last Modified: 07 Oct 2025Pattern Recognit. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Analyzed task dependencies of hierarchical visual features in LVLMs.•Proposed a module that fuses multi-layer visual features based on task instructions.•Integrating the module into LLaVA-v1.5 significantly improves performance over baseline and peers.•Reveals that higher-level features excel in semantics and lower-level features aid fine-grained tasks.
Loading