Muffin: Muffled Audio Encoding with Filter-Based Masking

Published: 05 Nov 2025, Last Modified: 05 Nov 2025NLDL 2026 AbstractsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: self-supervised learning, masked autoencoding, audio representation learning
TL;DR: We propose MUFFIN, a masked autoencoding framework that learns audio representations directly from raw waveforms using filter-based masking to bridge time- and frequency-domain modeling.
Abstract: Masked autoencoders have advanced representation learning in vision and language, yet audio remains dominated by spectrogram-based approaches which then are treated like images, disregarding audio-specific characteristics. We propose Muffled Audio Encoding, a framework for self-supervised learning directly on raw waveforms using 1D transformers and masking through time-domain filters (e.g., low-, high-, and band-pass). This approach encourages representations that capture long-range and frequency-selective dependencies without requiring Fourier transforms and losing phase information. We outline our design and experimental plan for evaluating this method across multiple audio domains.
Serve As Reviewer: ~Marcel_A._Vélez_Vásquez1
Submission Number: 48
Loading