Zero-shot Controllable Music Generation from Videos Using Facial Expressions

Shilin Liu, Kyohei Kamikawa, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

Published: 01 Jan 2024, Last Modified: 08 Apr 2025GCCE 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper proposes a method to control a music generation process from videos using users’ facial expressions, aligning the music with their emotions. Unlike previous works, this method considers users’ facial expressions, which reflect their emotions, to create more resonant music. We establish a functional relationship between facial expressions and music attributes and integrate these functions into the music generation process in a generation phase. Then zero-shot controllable music generation with facial expressions becomes feasible. To address the challenge of objectively verifying the generated music’s emotional correspondence, we introduce a novel evaluation metric that compares video, music, and emotions within a common latent space. The experimental results show the effectiveness of our method.