Think Out Loud, Pause in Silence: Confidence-Guided Reflect–Pause–Abort for Robust Audio Perceptual Understanding
Keywords: Large Audio Language Models, Latent Reasoning, Reinforcement Learning, Audio Understanding
TL;DR: Think Out Loud, Pause in Silence: Confidence-Guided Reflect–Pause–Abort for Robust Audio Perceptual Understanding
Abstract: Large Audio Language Models (LALMs) mainly fail for two errors: perceptual errors misidentifying background sounds or speaker turns, and reasoning errors drifting rationales that decouple from acoustic evidence. To address these issues, we propose an adaptive framework that couples perceptual grounding with computation that expands only when needed. First, we introduce **PAQA**, a Perceptually grounded Audio QA dataset of 7,470 multiple-choice items that pairs multi-speaker, background-rich audio with stepwise reasoning and reflection annotations, enabling supervision of verifiable audio-grounded rationales. On the modeling side, we propose **ConfAudio**, which unifies explicit, reflective reasoning (fine-tuned on PAQA) with implicit, pause-driven latent computation trained via GRPO. A confidence-aware controller monitors lowest-group-confidence (LGC) during decoding to insert pauses when uncertainty rises and to abort unstable trajectories, thereby reallocating compute toward hard perceptual segments. To stabilize the training process, we design **a composite reward function** that balances answer correctness, reasoning–answer consistency with perceptual robustness, and output format. Across PAQA, MMAU-mini, and MMAR, ConfAudio consistently improves both accuracy and consistency, particularly in noisy, multi-speaker conditions. Our results demonstrate that confidence-guided, adaptive reasoning—grounded in verifiable acoustic evidence—mitigates the dominant perceptual and reasoning failure modes in audio question answering.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3409
Loading