JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transition

Jiashuo Yu; Yao Yao; Boyu Chen; Alex Wang

JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transition

Jiashuo Yu, Yao Yao, Boyu Chen, Alex Wang

10 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: video-to-music generation; video soundtracking; music generation; diffusion model; generative models; transition

TL;DR: A video-to-music diffusion-based framework to generate arbitrary-length high-fidelity music waveforms with smooth transition.

Abstract: We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present \texttt{JenBridge}, a modular and interpretable framework for adaptive long-form video soundtracking that ensures both high-fidelity audio generation and transition naturalness. The core architecture is a Transformer-based generative model trained with a flow-matching objective, following a two-stage paradigm: pretraining on large-scale text–audio corpora to establish robust musical priors, then adapting to the video domain with dual text–visual conditioning for precise cross-modal alignment. Crucially, to achieve long-form coherence across diverse scene changes, \texttt{JenBridge} incorporates a novel adaptive transition mechanism. This system features a versatile toolkit of transition styles, including a generative transition method, and uniquely employs a Large Language Model (LLM) Agent that acts as a director to select the most appropriate transition for each narrative shift intelligently. To rigorously assess this task, we propose the LVS Benchmark, a new benchmark that includes a curated dataset and novel evaluation metrics focusing on holistic and transition-aware assessment. Extensive experiments on the proposed benchmark demonstrate that \texttt{JenBridge} significantly outperforms existing methods in both objective and subjective metrics, particularly in terms of transition naturalness and overall narrative coherence. JenBridge represents a significant step towards fully automated, professional-quality video soundtracking. The codes and benchmark will be made publicly available.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 3583

Loading