Abstract: Highlights•Analyzed task dependencies of hierarchical visual features in LVLMs.•Proposed a module that fuses multi-layer visual features based on task instructions.•Integrating the module into LLaVA-v1.5 significantly improves performance over baseline and peers.•Reveals that higher-level features excel in semantics and lower-level features aid fine-grained tasks.
External IDs:dblp:journals/pr/LiZCCLLLX26
Loading