Keywords: Audio reasoning, symbolic representations, large language models, structured prompting, audio question answering
Abstract: While large language models (LLMs) have made huge strides in text and vision, their ability to reason about sound remains limited. Most recent approaches rely on dense audio embeddings that are hard to interpret and often fail on tasks requiring fine-grained or structured understanding.
This project introduces SAR-LM, a symbolic audio reasoning pipeline that extracts structured, text-based features from audio across three aspects: speech, general sound, and music. For speech, we use Whisper-large and Wav2Vec2-based emotion recognition. For sound events, we rely on PANNs. For music, we combine low-level transcription from MT3, mid-level chord progressions from Chordino, and high-level tags from MusicNN. These symbolic features are used in two ways: either directly as flat prompts, or summarized into natural-language captions using Gemini 2.5 Pro. To evaluate performance, we compare both approaches against captions generated end-to-end from raw audio, and a mixed version using both symbolic and audio inputs.
We test all methods on the MMAU benchmark, which pairs audio clips with multiple-choice questions for audio understanding and reasoning across speech, music, and environmental sounds. We find that symbolic prompts can match or outperform dense baselines in several reasoning tasks. These findings suggest that symbolic audio inputs, combined with structured prompting, offer a promising path toward more accurate and explainable audio question answering with LLMs.
Submission Number: 21
Loading