By treating a reference video with the wanted semantic as a video prompt and achieving plug-and-play in-context generation via mixture-of-transformers structure, we can generate videos that are semantically consistent with the reference videos.
Our Video-as-Prompt model supports various downstream applications:
(1) Different reference videos (different semantic) → same reference images: generate the video aligned with each semantics consistently;
(2) Different reference videos (same semantic) → same reference images: generate the video aligned with the provided semantics consistently;
(3) Same reference videos → different reference images: transfer the same semantic (concept/style/motion/camera) to different reference images;
(4) Same reference video & image + user-modified prompt: preserve semantics and identity while using prompt to adjust fine-grained attributes.
Given different reference videos with the different semantic and a reference image, our model can also generate videos aligned with each semantics in the given reference videos.
Given different reference videos with the same semantic and a reference image, our model can consistently generate videos aligned with the provided semantics in the given reference videos.
Given a reference video, our model can generate new videos based on different reference images that are semantically consistent with the reference video.
Given a reference video and a reference image, our model can preserve semantics and identity while using prompt to adjust some fine-grained attributes.
Given reference videos with unseen semantics, our model can generate videos that are semantically consistent with the reference videos in a zero-shot manner.