Band-Split Self-supervised Mamba for Infant-centered Audio Analysis

Xulin Fan, Jialu Li, Mark Hasegawa-Johnson, Nancy L. McElwain

Published: 2025, Last Modified: 07 Jan 2026INTERSPEECH 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Infant-worn audio recorders provide a valuable means to analyze an infant’s home environment and vocal interactions with family members. Recent advances in self-supervised learning on large unlabeled datasets and supervised training on limited annotated data have improved performance in this domain. However, data scarcity remains a challenge. We introduce Band-Split SSAMBA (BS-SSAMBA), a self-supervised representation learning method that incorporates band-specific projections and a band-agnostic Mamba encoder to model temporal relationships across frequency bands. Designed for data-efficient learning, BS-SSAMBA effectively leverages both unlabeled and labeled in-domain data. Through extensive experiments on family audio recordings, we show that BS-SSAMBA outperforms vanilla SSAMBA and wav2vec2-based models, demonstrating its effectiveness for infant-centered audio tagging.

External IDs:dblp:conf/interspeech/Fan0HM25