Keywords: Music Information Retrieval (MIR), Soundtrack Retrieval, Film Production, Audio-Visual Alignment, Multimodal Learning, Representation Learning, Natural Language Processing (NLP), Deep Learning
TL;DR: We propose a system for soundtrack retrieval in film production that leverages multimodal learning to align music with cinematic context (i.e. images and text)
Abstract: Music is an integral part of an enjoyable cinematic experience, elevating both the emotional depth and narrative. The growth of platforms that allow film producers to license soundtracks from extensive collections has enabled low-budget filmmakers to achieve high-quality productions. However, with the vast amount of content available, it becomes paramount to effectively retrieve soundtracks to help producers find suitable tracks without a significant time investment. Current soundtrack retrieval systems on these platforms rely heavily on the selection of tags, which can be time-consuming due to the large number tracks associated with each tag. In this work, we introduce a multi-modal transformer architecture with a cross-attention mechanism, trained using an image, plot summaries and relevant tags, through contrastive learning. Our objective evaluations demonstrate that our model effectively utilizes all three inputs to retrieve soundtracks that are fitting for a film.
Track: Paper Track
Confirmation: Paper Track: I confirm that I have followed the formatting guideline and anonymized my submission.
Submission Number: 2
Loading