TL;DR: We train a contrastive learning music model for stem retrieval, it achieves state-of-the-art due to its phase-aware architecture.
Abstract: Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70$% over the state-of-the-art while requiring $<50$% of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-invariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.
Lay Summary: When music is mixed, every instrument must lock together in time. A drummer even slightly off-beat with the bassist makes the whole track feel wrong. Current AI models are blind to this: designed to recognize what is in a recording, they discard the timing information needed to judge whether those parts actually fit together. We introduce PHALAR, a model that fixes this by exploiting a mathematical property of sound. Shifting audio in time rotates its frequency-domain representation by a proportional angle, and rather than throwing this rotation away like standard models do, PHALAR is built to preserve it, encoding rhythmic alignment as a geometric angle in a complex-valued space. The result is a model that judges musical coherence up to 70% more accurately than the previous best approach, using half the parameters and training seven times faster, while aligning significantly closer with how human listeners perceive whether stems belong together.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/gladia-research-group/phalar/tree/main
Primary Area: Deep Learning->Other Representation Learning
Keywords: Contrastive Learning, Complex-Valued Neural Networks, Music Information Retrieval
Originally Submitted PDF: pdf
Submission Number: 28719
Loading