# AD-VEGAS: Auto-Descriptive Video Editing and Generation with Multi-Granular Superpixel Aggregation

<p align="center">
<img src="./assets/ad-vegas-teaser.png" alt="teaser image"/>
</p>
Recent video generative models have achieved impressive progress in conditional generation/editing, but they typically depend on accurate, detailed text descriptions of the input videos. This requirement limits the practical application of these video models, as the majority of web-scale raw videos lack such labor-intensive textual descriptions. This paper proposes AD-VEGAS, a versatile and user-friendly video-to-paragraph-to-video generative framework for zero-shot localized video editing and conditional generation. AD-VEGAS introduces two principal steps: Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V). In the V2P phase, we describe video scenes in natural language, capturing both holistic contexts and focused details through multi-granular superpixel-based video localization. Subsequently, in the P2V phase, it proceeds to edit or create videos, focusing on important regions or characteristics of the subjects based on users’ intentions. We note that our video generative model stands out from other baselines in multiple ways: (1) Our framework automatically interprets raw videos in natural language, effectively capturing both the broad context and intricate details. This capability enables users to create or edit videos based on our comprehensive descriptions, preserving the characteristics of the visual subjects without the necessity for sophisticated human annotations. (2) we enhance the capability of video generation and editing models by guiding them with localized video features and descriptions, supporting multi-scene video creation and localized editing. Our AD-VEGAS achieves superior performance against strong baselines on video-to-paragraph generation (up to +38.4% in human evaluations), video generation (+36.9% in FVD), and video editing tasks (+9.9% in CLIP-text). We provide extensive quantitative/qualitative analysis of AD-VEGAS, including the quality of generated descriptions and videos. Our code is provided in the supplementary.



## Stage1: Video-to-Paragraph Generation

For the YouCook2 dataset,
```
python test_youcook.sh 0 
```

For the ActivityNet dataset,
```
python test_anet.sh 0 
```


## Stage2: Paragraph-to-Video for Conditional Video Generation and Video Editing

Conditional Video Generation

please prepare the environment following instructions in DynamiCrafter [repo](https://github.com/Doubiiu/DynamiCrafter), and run DynamiCrafter with generated captions.

```
cd DynamiCrafter
sh scripts/run.sh 1024
```

Video Editing

please prepare the environment following instructions in TokenFlow [repo](https://github.com/omerbt/TokenFlow), and run TokenFlow with generated captions.

```
cd TokenFlow
python preprocess.py --data_path <data/myvideo.mp4> \
                     --inversion_prompt <'' or a string describing the video content>
python run_tokenflow_pnp.py
```
