Keywords: Music-to-Image Generation; Multi-modal Learning; Human-in-the-loop Machine Learning; Reinforcement Learning from Human Preference
Abstract: Mapping temporally evolving musical affect into coherent visual imagery is a challenging instance of cross-modal generation: audio is abstract, layered, and subjective, whereas images are static and concrete. We present MusePainter, a general framework that integrates structured cross-modal alignment with multi-axis preference learning to achieve fine-grained controllability in generative models. MusePainter first extracts structured descriptors capturing structural, stylistic, and affective dimensions of music, which serve as controllable guidance for image synthesis. To handle subjectivity, we introduce a preference optimization scheme that disentangles emotional consistency, semantic alignment, and creative appeal, and optimizes them independently. Experiments on curated benchmarks and user studies demonstrate that MusePainter surpasses strong audio-to-image and audio→text→image baselines in semantic fidelity, stylistic congruence, and affective resonance. While developed for music-to-image, the framework’s components—such as interpretable descriptors and multi-axis preference optimization—may also extend to other modalities, offering potential insights for broader controllable cross-modal generation.
Supplementary Material: pdf
Primary Area: generative models
Submission Number: 6460
Loading