Video-As-Prompt: Unified Semantic Control for
Video Generation

* This page contains many high-fidelity video demonstrations; we recommend waiting until all videos have completely loaded to ensure smooth and accurate visualization.
This offline project page is used for the ICLR 2026 submission. Due to supplementary size limit, we only provide the application section and the zero-shot section. Please visit the online project page for the full version: https://video-as-prompt.github.io/
Treating Reference Videos As In-Context Prompts

By treating a reference video with the wanted semantic as a video prompt and achieving plug-and-play in-context generation via mixture-of-transformers structure, we can generate videos that are semantically consistent with the reference videos.

Method overview
Applications

Our Video-as-Prompt model supports various downstream applications:
(1) Different reference videos (different semantic) → same reference images: generate the video aligned with each semantics consistently;
(2) Different reference videos (same semantic) → same reference images: generate the video aligned with the provided semantics consistently;
(3) Same reference videos → different reference images: transfer the same semantic (concept/style/motion/camera) to different reference images;
(4) Same reference video & image + user-modified prompt: preserve semantics and identity while using prompt to adjust fine-grained attributes.

Different Reference Video (different semantic) and Same Reference Images

Given different reference videos with the different semantic and a reference image, our model can also generate videos aligned with each semantics in the given reference videos.

Different Reference Video (same semantic) and Same Reference Images

Given different reference videos with the same semantic and a reference image, our model can consistently generate videos aligned with the provided semantics in the given reference videos.

Same Reference Video and Different Reference Images

Given a reference video, our model can generate new videos based on different reference images that are semantically consistent with the reference video.

Reference Video
Generated Video 1
Generated Video 2
Generated Video 3
Same Reference Video & Image and User-Modified Prompt

Given a reference video and a reference image, our model can preserve semantics and identity while using prompt to adjust some fine-grained attributes.

... a Ladudu toy character with black fur...
... a Ladudu toy character with golden fur...
... a Ladudu toy character with green fur...
... a Ladudu toy character with purple fur...
... a Ladudu toy character with red fur...
... a Ladudu toy character with white fur...
Zero-Shot Semantic-Guided Generation

Given reference videos with unseen semantics, our model can generate videos that are semantically consistent with the reference videos in a zero-shot manner.