PHALAR: Phasors for Learned Musical Audio Representations

Davide Marincione; Michele Mancusi; Giorgio Strano; Luca Cerovaz; Donato Crisostomi; Roberto Ribuoli; Emanuele Rodolà

PHALAR: Phasors for Learned Musical Audio Representations

Davide Marincione, Michele Mancusi, Giorgio Strano, Luca Cerovaz, Donato Crisostomi, Roberto Ribuoli, Emanuele Rodolà

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We train a contrastive learning music model for stem retrieval, it achieves state-of-the-art due to its phase-aware architecture.

Abstract: Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70$% over the state-of-the-art while requiring $<50$% of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-invariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.

Lay Summary: When music is mixed, every instrument must lock together in time. A drummer even slightly off-beat with the bassist makes the whole track feel wrong. Current AI models are blind to this: designed to recognize what is in a recording, they discard the timing information needed to judge whether those parts actually fit together. We introduce PHALAR, a model that fixes this by exploiting a mathematical property of sound. Shifting audio in time rotates its frequency-domain representation by a proportional angle, and rather than throwing this rotation away like standard models do, PHALAR is built to preserve it, encoding rhythmic alignment as a geometric angle in a complex-valued space. The result is a model that judges musical coherence up to 70% more accurately than the previous best approach, using half the parameters and training seven times faster, while aligning significantly closer with how human listeners perceive whether stems belong together.

Originally Submitted Supplementary Material: zip

Link To Code: https://github.com/gladia-research-group/phalar/tree/main

Primary Area: Deep Learning->Other Representation Learning

Keywords: Contrastive Learning, Complex-Valued Neural Networks, Music Information Retrieval

Originally Submitted PDF: pdf

Submission Number: 28719

Loading