BABBLE: Bridging Structured Labels and Natural Language for Infant-Centric Home Audio Captioning

ACL ARR 2026 January Submission7769 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Audio–Language Models, Infant-Centric Audio understanding, Fine-Grained Audio Analysis, Child-Centered Speech Processing; Speaker Diarization
Abstract: Children’s early development is shaped by contingent vocal exchanges with caregivers, yet current audio large language models (LLMs) often fail in infant-centric home recordings because they are trained primarily on adult-directed, lexical speech and coarse web-scale audio. As a result, they struggle with non-lexical vocalizations (e.g., infant babbling and crying) and the fine-grained temporal structure needed to interpret naturalistic caregiver–infant interactions. We introduce \textbf{BABBLE}, a compact audio–language modeling framework that bridges structured developmental annotations and natural-language supervision by converting time-stamped labels into captions. BABBLE combines Whisper-derived semantic features with wav2vec~2.0 acoustic representations in a dual-encoder architecture and supports frame-level labeling, event-level prediction, diarization-oriented outputs, and captioning through a unified formulation. Experiments on infant-centric home audio from 63 families (infants aged 3–14 months) with family-disjoint splits show that BABBLE outperforms recent audio LLMs and strong audio-only baselines for speaker and vocalization prediction, improving captioning metrics and reducing diarization error. These results indicate that structured-to-caption supervision is an effective strategy for extending audio–language models to underrepresented, privacy-sensitive, and non-lexical real-world audio domains. Our code are available at \url{https://anonymous.4open.science/r/BABBLE/}
Paper Type: Long
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: spoken language understanding; speech technologies;
Contribution Types: Data analysis
Languages Studied: English
Submission Number: 7769
Loading