video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM
Abstract: While recent advancements in reasoning optimization have significantly enhanced the capabilities of large language models (LLMs), existing efforts to improve reasoning have been limited to solving mathematical problems and focusing on visual graphical inputs, neglecting broader applications in general video understanding. This paper proposes video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks. To enhance its reasoning abilities, we develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions. We also propose process direct preference optimization (pDPO), which leverages contrastive step selection to achieve efficient step-level reward modelling tailored for multimodal inputs. Additionally, we introduce RivaBench, the first reasoning-intensive video understanding benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs across scenarios such as standup comedy, academic presentations, and synthetic video detection. video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLaVA-OneVision baseline across different video reasoning benchmarks. Besides, pDPO achieves 6-8% improvements compared to the supervised fine-tuning model on RivaBench. Enhanced reasoning enables video-SALMONN-o1 zero-shot synthetic video detection capabilities.
Lay Summary: We develop a new AI system called video-SALMONN-o1 that is the first open-source audio-visual large language model that can perform reasoning to help understand videos better. While past efforts to improve model reasoning ability mainly focused on solving math problems or analyzing images, video-SALMONN-o1 is one of the first to focus on more general video content—like scenes from comedy shows, lectures, or detecting fake (synthetic) videos. To train the system, we created a special dataset filled with complex audio-visual questions and step-by-step answers. We also introduced a new training method called pDPO, which helps the the model learn to reason more effectively by rewarding it for choosing better steps in its thought process. We also built RivaBench, a new benchmark to test how well AI systems can reason about videos. On this benchmark, video-SALMONN-o1 performed significantly better than existing models—improving accuracy by 3–8%. It even showed abilities to detect fake videos without being specifically trained for that task. In short, video-SALMONN-o1 takes a big step forward in helping AI understand and reason about complex video content more like humans do.
Link To Code: https://github.com/BriansIDP/video-SALMONN-o1
Primary Area: Deep Learning->Large Language Models
Keywords: audio-visual, chain-of-thought reasoning, large language models, process reward model, direct preference optimization
Submission Number: 7225
Loading