Keywords: Streaming video translation, diffusion models, feature banks
TL;DR: We present StreamV2V to support real-time video-to-video translation for streaming input.
Abstract: This paper introduces StreamV2V, a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts.
Unlike prior V2V methods using batches to process limited frames, we opt to process frames in a streaming fashion, to support unlimited frames.
At the heart of StreamV2V lies a backward-looking principle that relates the present to the past.
This is realized by maintaining a feature bank, which archives information from past frames.
For incoming frames, StreamV2V extends self-attention to include banked keys and values, and directly fuses similar past features into the output.
The feature bank is continually updated by merging stored and new features, making it compact yet informative.
StreamV2V stands out for its adaptability and efficiency, seamlessly integrating with image diffusion models without fine-tuning.
It can run 20 FPS on one A100 GPU, being 15$\times$, 46$\times$, 108$\times$, and 158$\times$ faster than FlowVid, CoDeF, Rerender, and TokenFlow, respectively.
Quantitative metrics and user studies confirm StreamV2V's exceptional ability to maintain temporal consistency.
Supplementary Material: zip
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6311
Loading