Abstract: Infant-worn audio recorders provide a valuable means to analyze an infant’s home environment and vocal interactions with family members. Recent advances in self-supervised learning on large unlabeled datasets and supervised training on limited annotated data have improved performance in this domain. However, data scarcity remains a challenge. We introduce Band-Split SSAMBA (BS-SSAMBA), a self-supervised representation learning method that incorporates band-specific projections and a band-agnostic Mamba encoder to model temporal relationships across frequency bands. Designed for data-efficient learning, BS-SSAMBA effectively leverages both unlabeled and labeled in-domain data. Through extensive experiments on family audio recordings, we show that BS-SSAMBA outperforms vanilla SSAMBA and wav2vec2-based models, demonstrating its effectiveness for infant-centered audio tagging.
External IDs:dblp:conf/interspeech/Fan0HM25
Loading