Abstract: Recent advancements in large language models have influenced the development of video large multimodal models (VLMMs). Previous approaches for VLMMs involve Supervised Fine-Tuning (SFT) with instruction-tuned datasets, integrating LLM with visual encoders, and additional learnable parameters. Here, aligning video with text, and vice versa, remains a challenge, primarily due to the insufficient quality and quantity of multimodal instruction-tune data compared to that of text-only. This discrepancy often results in alignments that poorly ground the video content. % that are not well-grounded in the video content.To address this, we present a novel alignment strategy that employs a multimodal AI system equipped with Reinforcement Learning from AI Feedback (RLAIF), providing self-preference feedback to refine itself and facilitating the alignment of video and text modalities. Our approach uniquely integrates detailed video descriptions as context into a multimodal AI system during the preference feedback generation to enrich the understanding of video content, a process we call \emph{context-aware reward modeling}. Empirical evaluations on various video benchmarks demonstrate that our \method outperforms existing approaches, including the SFT model. Public release of our code, models, and datasets is forthcoming.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability
Languages Studied: English
0 Replies
Loading