Effortless Event-Augmented Latent Diffusion for Video Frame Interpolation

Abstract

Latent Diffusion Models have advanced video frame interpolation by generating intermediate frames between input frames. However, effectively handling large temporal gaps and complex motion remains a challenge, often leading to artifacts. We argue that event camera signals, with their ability to capture continuous motion at high temporal resolutions, are ideal for bridging these temporal gaps and enhancing interpolation precision. Given the impracticality of training an event-assisted model from scratch, we introduce a novel adapter-based framework that seamlessly and effortlessly integrates high-temporal-resolution cues from event cameras into pre-trained image-to-video models without modifying their underlying structure. Our method leverages Image Warped Events (IWEs) and bidirectional sparse optical flow for precise spatial and temporal alignment, significantly reducing artifacts and improving interpolation quality. Experimental results demonstrate that our event-enhanced interpolation achieves superior accuracy and temporal coherence compared to existing state-of-the-art methods.

Method

Illustration of Our Framework. (a) We extract bidirectional sparse optical flow and IWEs from the input event stream using the Contrast Maximization (CMax) method. (b) During fine-tuning, the model is enhanced with three components: an IWE encoder, alignment adapters inserted into a subset of DiT blocks, and LoRA layers applied to all DiT blocks. (c) The flow-based alignment adapter leverages the bidirectional flows to warp intermediate features from neighboring frames, aligning them temporally with the current frame. This facilitates motion-consistent feature propagation throughout the denoising process

Baseline comparisons

x 24 interpolation

Input pairs
Ground Truth
Timelens
CBMNet
WAN2.1 FLF2V
Ours
frame 1
frame 1
frame 1
frame 1
frame 1

Baseline comparisons with VDM-EVFI

x 12 interpolation

Input pairs
Ground Truth
VDM-EVFI
Ours
frame 1
frame 1
frame 1