Abstract: Extending image-based Large Multimodal Models (LMMs) to video-based LMMs always requires temporal modeling in the pre-training. However, training the temporal modules gradually erases the knowledge of visual features learned from various image-text-based scenarios, leading to degradation in some downstream tasks. To address this issue, in this paper, we introduce a novel, efficient transfer approach termed MTransLLAMA, which employs transfer learning from pre-trained image LMMs for fine-grained video tasks with only small-scale training sets. Our method enables fewer trainable parameters and achieves faster adaptation and higher accuracy than pre-training video-based LMM models. Specifically, our method adopts early fusion between textual and visual features to capture fine-grained information, reuses spatial attention weights in temporal attentions for cyclical spatial-temporal reasoning, and introduces dynamic attention routing to capture both global and local information in spatial-temporal attentions. Experiments demonstrate that across multiple datasets and tasks, without relying on video pre-training, our model achieves state-of-the-art performance, enabling lightweight and efficient transfer from image-based LMMs to fine-grained video tasks.
External IDs:dblp:journals/tmm/CaoZHCHWW25
Loading