Keywords: Multimodal music tokenization, Generative music retrieval, Music recommendation systems
TL;DR: We introduce a multimodal music tokenizer that converts music metadata into compact, hierarchical tokens using residual quantization. These tokens enable scalable music retrieval, LLM integration, and downstream generative applications.
Abstract: Recent advances in generative retrieval allow large language models (LLMs) to recommend items by generating their identifiers token by token, rather than using nearest-neighbor search over embeddings. This approach requires each item, such as a music track, to be represented by a compact and semantically meaningful token sequence that LLMs can generate. We propose a multimodal music tokenizer (3MToken) that transforms rich metadata from a music database, including audio, credits, semantic tags, song and artist descriptions, musical characteristics, release dates, and consumption patterns into discrete tokens using a Residual-Quantized Variational Autoencoder. Our method learns hierarchical representations, capturing coarse features in early quantization levels and refining them at later levels, preserving fine-grained information. We train and evaluate our model on a large-scale dataset of 1.6 million tracks, and it achieves +40.0\%, +43.4\%, and +15.8\% improvements in Precision@k, Recall@k, and Hit@k, respectively, over the baselines.
Track: Paper Track
Confirmation: Paper Track: I confirm that I have followed the formatting guideline and anonymized my submission.
Submission Number: 17
Loading