Keywords: large audio language model, mult-audio, audio understanding, audio reasoning
TL;DR: Multi-audio analysis and reasoning capable Large Audio Language Model, data and benchmark.
Abstract: While Large Audio Language Models (LALMs) have achieved superior performance on reasoning over single audio clips, their ability to understand and reason over multiple audio clips remains a significant challenge. In this paper, we introduce PolyAudio, a novel LALM specifically designed for this complex task. To systematically train and evaluate our model, we first identify and formalize eleven foundational multi-audio reasoning capabilities. These capabilities, spanning sound, music, and speech, are designed to represent a broad range of challenging real-world scenarios. To enhance these skills, we fine-tune the Qwen2-Audio-7B-Instruct model using Group Relative Policy Optimization (GRPO). This approach mitigates common issues associated with Supervised Fine-Tuning (SFT), such as catastrophic forgetting. Specifically, we construct preference data that explicitly rewards the model for correctly synthesizing information across multiple audio clips. Our model, PolyAudio, achieves 58.6% on the MMAU-Pro multi-audio subset and 71.2% on our PolyAudio-Bench, substantially outperforming baselines on multi-audio reasoning tasks while maintaining its performance on single-audio tasks. To promote research in this space, we will publicly release the model, data generators, evaluation scripts, and training recipes at the time of publication.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 23472
Loading