# Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

## Setup

Setup your Azure OpenAI API config in the below files. You can find your keys in Azure Portal. We recommend using [python-dotenv](https://github.com/theskumar/python-dotenv) to store and load your keys.
- `DSG/openai_utils.py`
- `DSG/dsg_questions_gen.py`
- `DSG/query_utils.py`
- `DSG/vqa_utils.py`

```python
client = AzureOpenAI(
            azure_endpoint = # your keys,  
            api_key= # your keys,  
            api_version=# your keys,  
            )
```

output_root="./results/multi_k/k10/vc2/"      # output path 

## 1. Generate evaluation questions

Generate DSG$^{\text{obj}}$-based evaluation questions from the input text prompts

```bash
python DSG/dsg_questions_gen.py
```

## 2. Refine video based on the evaluation results

```bash
output_root="./results/multi_k/k10/vc2/"      # output path 
eval_sections=("count")                       # eval dimension (e.g., count, )

for section in "${eval_sections[@]}"
do
    CUDA_VISIBLE_DEVICES=1,2,3 python main_iter.py \
                        --output_root="$output_root" \
                        --eval_section="$section" \
                        --model='t2vturbo' \              # t2v model backbone 
                        --load_molmo \
                        --selection_score='dsg_blip' \    # video ranking metric 
                        --seed=123 \                      # random seed 
                        --round=1 \                       # iteration round 
                        --k=10 \                          # number of video candidates 
                        --div_seeds                       # use diverse seed per iterative rounds. 


    # Make evalcrafter evaluation format 
    CUDA_VISIBLE_DEVICES=1 python collect_best.py --output_root="$output_root" --eval_section="$section" --data='evalcrafter' --round=1
done
```
