Keywords: Text-to-audio diffusion, Cross-attention, Model interpretability, Time–frequency (mel) representations, Instrument band alignment, Attention visualization, Audio generation, Music information retrieval
TL;DR: AIBA: a training-free probe for text-to-audio diffusion. It logs cross-attention at inference, maps it to time–mel grids, and scores alignment with instrument bands—showing interpretable, high-precision (moderate-recall) patterns on AudioLDM2.
Abstract: We present AIBA (Attention-In-Band Alignment), a lightweight, training-free pipeline to quantify where text-to-audio diffusion models attend on the time–frequency (T–F) plane. AIBA (i) hooks cross-attention at inference to record attention probabilities without modifying weights; (ii) projects them to fixed-size mel grids that are directly comparable to audio energy; and (iii) scores agreement with instrument-band ground truth via interpretable metrics (T–F IoU/AP, frequency-profile correlation, and a pointing game). On Slakh2100 with an AudioLDM2 backbone, AIBA reveals consistent instrument-dependent trends (e.g., bass favoring low bands) and achieves high precision with moderate recall.
Track: Paper Track
Confirmation: Paper Track: I confirm that I have followed the formatting guideline and anonymized my submission.
Submission Number: 9
Loading