Myna: Masking-Based Contrastive Learning of Musical Representations

ICLR 2026 Conference Submission7889 Authors

16 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: self-supervised learning, music representations, contrastive learning, masking
TL;DR: Myna is a masking-based contrastive framework for musical representation learning on mel-spectrograms; it breaks the SOTA for self-supervised methods on music key detection and is competitive with MULE/MERT-95M across various MIR tasks.
Abstract: In this paper, we present Myna, a simple yet effective approach for self-supervised musical representation learning. Built on a contrastive learning framework, Myna introduces two key innovations: (1) the use of a Vision Transformer (ViT) on mel-spectrograms as the backbone, replacing SampleCNN on raw audio; and (2) a simple yet novel data augmentation strategy—token masking—that masks 90% of spectrogram tokens (e.g., 16x16 patches). These innovations deliver both effectiveness and efficiency: (i) Token masking enables a significant increase in per-GPU batch size, from 48 or 120 in traditional contrastive methods (e.g., CLMR, MULE) to 4096. (ii) By avoiding traditional augmentations (e.g., pitch shifts), Myna retains pitch sensitivity, enhancing performance in tasks like key detection. (iii) The use of vertical patches (128x2 instead of 16x16) allows the model to better capture critical features for key detection. Our hybrid model, Myna-22M-Hybrid, processes both 16x16 and 128x2 patches, achieving state-of-the-art results. Trained on a single GPU, it outperforms MULE (62M) on average and rivals MERT-95M, which was trained on 16 and 64 GPUs, respectively. When scaled to 85M parameters, Myna achieves further improvements across all tasks and is competitive with models like MERT-330M, MusicFM, and MuQ despite being 3-7x smaller and trained with an order of magnitude fewer GPUs in less time. Additionally, it surpasses MERT-95M-public and MuQ$_{m4a}$, establishing itself as the best-performing model trained on publicly available data. We release our code and models to promote reproducibility and facilitate future research: https://github.com/ghost-signal/myna
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 7889
Loading