AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers

Edson Araujo; Saurabhchand Bhati; Muhammad Jehanzeb Mirza; Andrew Rouditchenko; Brian Kingsbury; Samuel Thomas; Rogerio Feris; James R. Glass; Hilde Kuehne

AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers

Edson Araujo, Saurabhchand Bhati, Muhammad Jehanzeb Mirza, Andrew Rouditchenko, Brian Kingsbury, Samuel Thomas, Rogerio Feris, James R. Glass, Hilde Kuehne

10 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reasoning, multi-modal learning, audio-visual

TL;DR: We propose AVRT, a framework that distills reasoning from single-modality teachers to enable efficient supervised fine-tuning, achieving state-of-the-art performance in audio-visual reasoning with minimal RL.

Abstract: While recent advances in reasoning models have shown remarkable progress in text-based domains, the development of effective reasoning capabilities in multimodal settings, particularly audio-visual, remains still a challenge, mainly because of the limited availability of high-quality reasoning data in target multimodal combinations. To address this problem, we introduce AVRT, a novel framework that generates high-quality audio-visual reasoning data by distilling knowledge from specialized single-modality teachers. To this end, we generate high-quality reasoning traces via a vision-reasoning and a audio-reasoning teacher and merge the resulting traces with an LLM merger model. This enables an two stage training with a supervised fine-tuning of student models as cold start followed by a reinforcement learning. The evaluation show that the resulting models achieves competitve performance on various datasets, i.a. OmniBench, DailyOmni, and MMAR, establishing a new pipeline for an effective training of audio-visual reasoning models.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 3713

Loading