Keywords: reasoning, multi-modal learning, audio-visual
TL;DR: We propose AVRT, a framework that distills reasoning from single-modality teachers to enable efficient supervised fine-tuning, achieving state-of-the-art performance in audio-visual reasoning with minimal RL.
Abstract: While recent advances in reasoning models have shown remarkable progress in text-based domains, the development of effective reasoning capabilities in multimodal settings, particularly audio-visual, remains still a challenge, mainly because of the limited availability of high-quality reasoning data in target multimodal combinations. To address this problem, we introduce AVRT, a novel framework that generates high-quality audio-visual reasoning data by distilling knowledge from specialized single-modality teachers. To this end, we generate high-quality reasoning traces via a vision-reasoning and a audio-reasoning teacher and merge the resulting traces with an LLM merger model. This enables an two stage training with a supervised fine-tuning of student models as cold start followed by a reinforcement learning. The evaluation show that the resulting models achieves competitve performance on various datasets, i.a. OmniBench, DailyOmni, and MMAR, establishing a new pipeline for an effective training of audio-visual reasoning models.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3713
Loading