Abstract: Musical mashups increasingly serve as a testbed for computational creativity, yet existing automated systems often reduce harmonic compatibility to surface-level matches in key or chord labels. This paper proposes a two-phase framework for chord latent decoupling that treats harmony as a transferable and disentangled latent dimension, enabling controllable mashup generation beyond traditional signal-level heuristics. Our method extracts chord latents with timbre-invariant training, and adversarial objectives to factorize music into (i) a harmony latent capturing chordal progression, and (ii) a complementary non-harmony latent encoding other information like melody, rhythm, and timbre. A joint decoder reconstructs coherent audio from these disentangled components, permitting crosssong recomposition where one track's harmonic structure guides another's musical content. Experiments on large-scale music datasets, combined with perceptual listening studies, demonstrate that proximity in the learned harmony space correlates with perceived mashup compatibility, and that harmonic transfer produces coherent, musically interpretable results. We discuss limitations in supervision granularity, residual leakage, and decoder fidelity, and outline some salient directions: scalable pseudo-label distillation, cross-modal harmony embeddings for retrieval and recommendation, and structure-aware alignment for long-horizon generation. Our results highlight chord-latent decoupling as a data-driven path to interpretable control in AI music generation.
External IDs:dblp:conf/bigdataconf/ChauH25
Loading