Multi-scale transformer language modeling for music classification tasks

Published: 08 Sept 2025, Last Modified: 10 Sept 2025LLM4Music @ ISMIR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Music Classification, Discrete Audio Tokens, Self-supervised Learning, Masked Language Modeling, Multi-scale Transformer, Music Information Retrieval (MIR), Rotary Position Embeddings (RoPE)
TL;DR: We introduce Mega-AudioFormer, a multi-scale Transformer that efficiently models long audio sequences as discrete tokens for music classification tasks.
Abstract: Most large-scale audio classification models process two-dimensional spectral data with convolutional neural network (CNN) or Vision Transformer (ViT) architectures, inheriting a vision inductive bias misaligned with the temporal nature of audio. While neural audio codecs offer a promising alternative by providing discrete, time-native representations, they produce sequences thousands of tokens long, rendering the usage of standard Transformer architectures computationally expensive. In this study, we present Mega-AudioFormer, a multi-scale Transformer-based model pre-trained from scratch on AudioSet with masked codec-token modeling and fine-tuned on music classification tasks. Our architecture features a global encoder over channel-packed sequences for efficient long-range context, augmented by a local encoder for fine-grained detail. This design confers a key advantage: decode-free inference directly in the compressed domain. Promising performance on music genre recognition (GTZAN), instrument classification (NSYNTH), and speech/music discrimination validates our approach. This work establishes a scalable and effective new direction for audio foundation models and is explicitly designed to leverage advancements from pre-trained language models.
Submission Number: 8
Loading