Semitone-Aware Fourier Encoding: A Music-Structured Approach to Audio-Text Alignment

Published: 23 Sept 2025, Last Modified: 08 Nov 2025AI4MusicEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Cross-modal alignment​, ​​Music information retrieval​, ​​Semantic consistency​
TL;DR: We bridge audio-text semantics by mapping music into theory-guided 12-tone features enhanced with Fourier encoding, achieving 9.5% higher alignment accuracy with interpretable structure.
Abstract: Conventional audio-text alignment methods predominantly rely on raw spectral features, which insufficiently capture the mathematical and perceptual structures inherent to music. We introduce a representation paradigm grounded in music theory: mapping frequency spectra into the 12-tone equal temperament system—an organization consistent with the logarithmic nature of human pitch perception and widely adopted across musical cultures—followed by Fourier-based feature encoding to capture nonlinear and multi-scale acoustic patterns. This framework enhances interpretability while preserving musically salient tonal structures, robustness to noise, and improved semantic alignment with textual descriptors. Preliminary experiments indicate that such music-theory-guided representations provide a principled foundation for bridging the audio-text modality gap. We suggest this direction as a promising step toward integrating cognitive insights and domain knowledge into cross-modal representation learning.
Track: Paper Track
Confirmation: Paper Track: I confirm that I have followed the formatting guideline and anonymized my submission.
(Optional) Short Video Recording Link: https://youtu.be/-lwArUK2Pi0
Submission Number: 7
Loading