Multimodal Music Tokenization with Residual Quantization for Generative Retrieval

Wo Jae Lee; Emanuele Coviello; Rifat Joyee; Sudev Mukherjee

Multimodal Music Tokenization with Residual Quantization for Generative Retrieval

Wo Jae Lee, Emanuele Coviello, Rifat Joyee, Sudev Mukherjee

Published: 23 Sept 2025, Last Modified: 08 Nov 2025AI4MusicEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal music tokenization, Generative music retrieval, Music recommendation systems

TL;DR: We introduce a multimodal music tokenizer that converts music metadata into compact, hierarchical tokens using residual quantization. These tokens enable scalable music retrieval, LLM integration, and downstream generative applications.

Abstract: Recent advances in generative retrieval allow large language models (LLMs) to recommend items by generating their identifiers token by token, rather than using nearest-neighbor search over embeddings. This approach requires each item, such as a music track, to be represented by a compact and semantically meaningful token sequence that LLMs can generate. We propose a multimodal music tokenizer (3MToken) that transforms rich metadata from a music database, including audio, credits, semantic tags, song and artist descriptions, musical characteristics, release dates, and consumption patterns into discrete tokens using a Residual-Quantized Variational Autoencoder. Our method learns hierarchical representations, capturing coarse features in early quantization levels and refining them at later levels, preserving fine-grained information. We train and evaluate our model on a large-scale dataset of 1.6 million tracks, and it achieves +40.0\%, +43.4\%, and +15.8\% improvements in Precision@k, Recall@k, and Hit@k, respectively, over the baselines.

Track: Paper Track

Confirmation: Paper Track: I confirm that I have followed the formatting guideline and anonymized my submission.

Submission Number: 17

Loading