Learning Audio Embeddings via Lyrics Alignment for Scalable Version Identification

Joanne Affolter; Benjamin Martin; Elena V. Epure; Frederic Kaplan

Learning Audio Embeddings via Lyrics Alignment for Scalable Version Identification

Joanne Affolter, Benjamin Martin, Elena V. Epure, Frederic Kaplan

Published: 08 Sept 2025, Last Modified: 10 Sept 2025LLM4Music @ ISMIR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Version Identification, Music Information Retrieval, Audio Representation Learning, Lyrics, Multimodal Alignment

TL;DR: We propose Lyrics-Informed Embeddings (LIE), audio representations trained to align directly with a lyric-derived semantic space.

Abstract: Version Identification aims to recognize distinct renditions of the same underlying musical work, a task central to catalog management, copyright enforcement, and recommendation. While state-of-the-art systems rely on complex audio pipelines, they remain computationally costly, opaque, and difficult to scale. We propose a lightweight and interpretable alternative: Lyrics-Informed Embeddings (LIE), audio representations trained to align directly with a lyric-derived semantic space. Our framework leverages advances in automatic speech recognition (Whisper) and multilingual sentence encoders to construct a robust target embedding space from transcribed lyrics. An audio encoder is then trained to project raw audio into this space, optimizing both instance-level alignment and structural consistency. LIE achieves retrieval accuracy on par with, or exceeding, transcription-based and state-of-the-art audio systems, while cutting inference latency by more than 3x relative to transcription pipelines. Our musically grounded framework is lightweight, reproducible, and yields interpretable embeddings that extend beyond version identification to broader music retrieval tasks.

Submission Number: 23

Loading