A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking

Zixiang Zhao; Haowen Bai; Bingxin Ke; Yukun Cui; Lilun Deng; Yulun Zhang; Kai Zhang; Konrad Schindler

A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking

Zixiang Zhao, Haowen Bai, Bingxin Ke, Yukun Cui, Lilun Deng, Yulun Zhang, Kai Zhang, Konrad Schindler

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: image fusion, video fusion, low-level vision

TL;DR: We present UniVF, a unified video fusion framework with multi-frame learning; VF-Bench, the first benchmark covering four video fusion tasks; and a unified spatial–temporal evaluation protocol with a new temporal consistency loss.

Abstract: The real world is dynamic, yet most image fusion methods process static frames independently, ignoring temporal correlations in videos and leading to flickering and temporal inconsistency. To address this, we propose Unified Video Fusion (UniVF), a novel and unified framework for video fusion that leverages multi-frame learning and optical flow-based feature warping for informative, temporally coherent video fusion. To support its development, we also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive benchmark covering four video fusion tasks: multi-exposure, multi-focus, infrared-visible, and medical fusion. VF-Bench provides high-quality, well-aligned video pairs obtained through synthetic data generation and rigorous curation from existing datasets, with a unified evaluation protocol that jointly assesses the spatial quality and temporal consistency of video fusion. Extensive experiments show that UniVF achieves state-of-the-art results across all tasks on VF-Bench. Project page: [vfbench.github.io](https://vfbench.github.io).

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 627

Loading