# Training-free Enhancement in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

Step1: Background Generation

1.1 Background image generation with T2I model
```bash
CUDA_VISIBLE_DEVICES=0 python bg_gen.py
```


1.2 Background animation with I2V model 
```bash
CUDA_VISIBLE_DEVICES=0 python bg_gen_video.py
```

Update the path to the LLM plans of the background description. 


Step2: Foreground object layout and trajectory planning

2.1 RAM + SAM for background object detection

```bash
CUDA_VISIBLE_DEVICES=0 python automatic_label_ram_demo.py \
  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
  --ram_checkpoint ram_swin_large_14m.pth \
  --grounded_checkpoint groundingdino_swint_ogc.pth \
  --sam_checkpoint sam_vit_h_4b8939.pth \
  --input_image background_images_flux
  --output_dir background_images_flux \
  --box_threshold 0.25 \
  --text_threshold 0.2 \
  --iou_threshold 0.5 \
  --device "cuda"
```


2.2: Video Sketch generation

```bash
CUDA_VISIBLE_DEVICES=0 python prior_gen_w_video.py
```

Update the path to the LLM plans of the foreground object.


Step3: Structured Noise Inversion for video generation 


```bash
CUDA_VISIBLE_DEVICES=0 python video_control_t2v.py --latent_path xxx --output_path xxx 
```

Update the path of the video sketch generated in previous steps, and the output path. 



Qualitative Examples:

We also provide videos shown in the paper under `baseline_videos` and `ours_videos`.