Keywords: Large Audio Language Models, Audio Understanding, On Policy Distillation, Cross-modal Alignment
Abstract: Large Audio Language Models (LALMs) have garnered significant research interest.
Despite being built upon text-based large language models (LLMs), LALMs frequently exhibit a degradation in knowledge and reasoning capabilities.
We hypothesize that this limitation stems from the failure of current training paradigms to effectively bridge the acoustic-semantic gap within the feature representation space.
To address this challenge, we propose CORD, a unified alignment framework that performs online cross-modal self-distillation.
Specifically, it aligns audio-conditioned reasoning with its text-conditioned counterpart within a unified model.
Leveraging the text modality as an internal teacher, CORD performs multi-granularity alignment throughout the audio rollout process.
At the token level, it employs on-policy reverse KL divergence with importance-aware weighting to prioritize early and semantically critical tokens.
At the sequence level, CORD introduces a judge-based global reward to optimize complete reasoning trajectories via Group Relative Policy Optimization (GRPO).
Empirical results across multiple benchmarks demonstrate that CORD consistently enhances audio-conditioned reasoning and substantially bridges the audio–text performance gap with only 80k synthetic training samples, validating the efficacy and data efficiency of our on-policy, multi-level cross-modal alignment approach.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodal Learning, Large Language Models, Reasoning, Reinforcement Learning, Knowledge Distillation, Speech and Language, Alignment, Optimization and Training Methods
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 7405
Loading