# Pipeline

This section introduces that how to use our codes to construct a benckmark for video generating models. We begin with a brief description for the whole process, and then discuss each step in detail.

## Brief

The whole process contains four parts:

1. **Register scenarios and generate examples**: Put the needed json files in `config` folder in a particular format, and run a python command;
2. **Generate videos**: Generate videos from prompts in `prompts` folder and put them in `videos` folder in a particular format, and run two python commands;
3. **Ask videos and observe variables**: Put the answers for each video in the same folder, and run a python command;
4. **Evaluate**: Run two python commands to get evaluation results.

## Register scenarios and generate examples

In this step, we need some json files containing information about the causal systems. For each causal systems, two `.json` files are needed. Their name should be the same. The first json file, containing prompts only specifying $X$, should be placed in the `config/samples` folder. The second one, containing prompts specifying both $X,Y$, should be placed in the `config/samples_with_results` folder. Each json file should have `scenario`, `roots`, `non_roots`, `rules`, `compositions` as keys. The value of `compositions` should be a list with length $2^{m_1}$, where $m_1$ is the length of value of `roots`. Moreover, the value of `rules` should ensure that there exist values of $X$ that $Y_j=1$ and $Y_j=0$ for each $Y_j$. 

The json files containing information about the scenarios should be placed in `config/samples` and `config/samples_with_results` folders. The file tree may be like:

```text
visual_causal/
├── README.md
├── ...
└── config/
    ├── samples/
    |   ├── 0_0.json
    │   └── 0_1.json
    └── samples_with_results/
        ├── 0_0.json
        └── 0_1.json
```

Then run the following command:

```bash
python generate_samples.py
```

This command will modify a series of files stored in database, register all scenarios and store their indexs in `database/scenario.csv`. It will also generate samples for each scenario, stored in `database/sample/{scenario_id}`. For a detailed description for each file, see the `Files` section. Finally, it will generate prompts for all samples. The prompts will be stored in `prompts/prompts_{scenario_id}.txt`.

## Generate videos

The videos are generated by an external process, which is not included in this project. After running the previous process, all prompts are saved in the `prompts` folder, names as `prmopts_1.txt`, `prompts_2.txt`, ..., `prompts_{n}.txt`, where the corresponding scenario ids are $1,2,\dots,n$. Please use these files to generate videos by video generation models. After generating video from all video generation models, please save the videos as the following structure (suppose there are two video generation models, named as `vgm1` and `vgm2`, $n$ scenarios and two videos for each scenario):

```text
visual_causal/
├── README.md
├── ...
└── videos/
    ├── vgm1/
    |   ├── 1/
    |   |   ├── 0.mp4
    |   |   └── 1.mp4
    |   ├── 2/
    |   |   ├── 0.mp4
    |   |   └── 1.mp4
    |   ├── ...
    |   └── {n}/
    |       ├── 0.mp4
    |       └── 1.mp4
    └── vgm2/
        ├── 1/
        |   ├── 0.mp4
        |   └── 1.mp4
        ├── 2/
        |   ├── 0.mp4
        |   └── 1.mp4
        ├── ...
        └── {n}/
            ├── 0.mp4
            └── 1.mp4
```

After that, run the following command to collect the videos into database:

```bash
python handle_new_videos.py
python copy_samples.py
```

The first command construct folders with names of video generation models (e.g. `vgm1` and `vgm2`) in the `database` folder. After that, it classify the videos by their value of factors and the index of selected prompt from all possible prompts.

This command will also move the original videos into the `used_videos` folder. If they are not needed for other use, you can just delete them for saving disk space.

The second command copy the generated sample into each folder of generation models (e.g. from the `sample` folder into the `vgm1` folder). The columns for observed variables in each sample table are blank, waiting to be filled in latter process.

## Ask videos and observe variables

In this step, we need to retrieve answers for each video before the following process. See [answer_retrieve/Readme.md](answer_retrieve/Readme.md) for a detailed description of how to retrive answers for each video.

The answer retrieving step should complete a task: recursively checking every file and folder in the `database` folder, and for every `.mp4` file encountered, generate a `.json` file with the same name as the `.mp4` file in the same folder. The json file should contain the value of all variables in the current scenario.

After retrieving answers for each video, run the following command:

```bash
python ask_videos.py --llm_names ""
```

This command will read all results and save them in the `observed_{name}` columns in `all_samples.csv` and `sample_text_consistency.csv` in each scenario. The argument `--llm_names ""` means ask videos for all VGMs. To ask videos for specific VGMs, please modify the default value of this argument in `ask_videos.py` and remove it from the command.

## Evaluate

Firstly, run the following command to perform evaluation for each VGM (video generation model):

```bash
python evaluate_models.py --llm_names ""
```

The results are saved in `database/{VGM_name}/results_{VGM_name}.csv` for each VGM.

Then run the following command to summary all results:

```bash
python script/output_results.py --llm_names ""
```

The results are saved in `database/final_res.json` as JSON. The keys are names of each VGM, and values are dictionaries containing mean and standard deviation of each metric.

### Evaluation for threshold-based metrics for rule consistency

For evaluation of threshold-based metrics for rule consistency ($s_3^{\mathrm{truth,threshold}}, s_3^{\mathrm{observe,threshold}}$), first run the following command:

```bash
python script/evaluate_models_threshold.py --llm_names ""
```
which will save the evaluation result in `database/{VGM_name}/results_threshold_{VGM_name}.csv` for every VGM. Then run the following command:

```bash
python script/output_results_threshold.py --llm_names ""
```
which will summary the threshold-based metrics and save them in `database/final_res_threshold.json`.


# Files

In this section, we will explain the meaning of each file in the database during the whole procedure.

## Files in `database` folder

### scenarios.csv

This file is generated during the [Register scenarios and generate examples](#register-scenarios-and-generate-examples) step. It contains two columns: `scenario_id` and `scenario`. Each row corresponds to a causal system. For each row, `scenario_id` is a integer representing the unique serial number of a causal system, and `scenario` is a string with the following format: `{json_name}%{description}`, where `json_name` is the name of the corresponding JSON file in `config/samples` and `config/samples_with_results` folder, i.e. the file with name `{json_name}.json`. That two json files contain all information about this causal system, including the scenario, roots, non-roots, rules and compositions of prompts. The `description` is a sentence describing this scenario.

For example, for a row in this file, we have `scenario_id=1` and `scenario="0_0%A small ball impacts the ground."` Therefore, files about this causal system are stored in the folder `database/{VGM_name}/1` for each VGM. Basic information about this causal system is stored in `config/samples/0_0.json` and `config/samples_with_results/0_0.json`. The description of this causal system is `"A small ball impacts the ground."`.

### final_res.json

This file is generated during the [Evaluate](#evaluate) step, containing the summary results for metrics. It is a dictionary with the name of each VGM as keys. Each value is also a dictionary with metrics as keys. The name of metrics and the corresponding notation in the paper are:

- `metric_1_all_ignore`: $s_1^{\mathrm{all}}$
- `metric_1_roots_ignore`: $s_1^{\mathrm{roots}}$
- `metric_2_truth`: $s_2^{\mathrm{truth}}$
- `metric_2_observe`: $s_2^{\mathrm{observe}}$
- `metric_3_truth`: $s_3^{\mathrm{truth}}$
- `metric_3_observe`: $s_3^{\mathrm{observe}}$
- `nan_ratio`: ratio of samples containing N/A answers in level 2,3 (samples in `database/{llm_name}/{scenario_id}/all_samples.csv`)
- `level_1`: ratio of N/A answers, and ratio of wrong answers in level 1 (samples in `database/{llm_name}/{scenario_id}/sample_text_consistency.csv`)
- `metric_1_all_fault` and `metric_1_roots_fault` are deprecated.

For each metric, the value is a dictionary containing the mean and standard variance of the metric.

### final_res_threshold.json

This file is generated during [Evaluation for threshold-based metrics for rule consistency](#evaluation-for-threshold-based-metrics-for-rule-consistency) step, containing the summary results for threshold-based metrics. It is a dictionary with the name of each VGM as keys. Each value is also a dictionary with metrics as keys. The name of metrics and the corresponding notation in the paper are:

- `metric_3_truth_0.65`: $s_3^{\mathrm{truth,threshold}}$ with threshold $t=0.65$;
- `metric_3_observe_0.65`: $s_3^{\mathrm{observe,threshold}}$ with threshold $t=0.65$;
- `metric_3_truth_0.75`: $s_3^{\mathrm{truth,threshold}}$ with threshold $t=0.75$;
- `metric_3_observe_0.75`: $s_3^{\mathrm{observe,threshold}}$ with threshold $t=0.75$;
- `metric_3_truth_0.85`: $s_3^{\mathrm{truth,threshold}}$ with threshold $t=0.85$;
- `metric_3_observe_0.85`: $s_3^{\mathrm{observe,threshold}}$ with threshold $t=0.85$;
- `metric_3_truth_0.95`: $s_3^{\mathrm{truth,threshold}}$ with threshold $t=0.95$;
- `metric_3_observe_0.95`: $s_3^{\mathrm{observe,threshold}}$ with threshold $t=0.95$;

For each metric, the value is a dictionary containing the mean and standard variance of the metric.


## Files for each VGM

The files are stored in the `database/{VGM_name}` folder. For example, if the name of VGM is `Pika`, then the files are stored in `database/Pika`. The name of files are determined by the name of VGM, represented by `{VGM_name}`.

### result_{VGM_name}.csv

This file is generated during the [Evaluate](#evaluate) step, containing evaluation results of metrics for each causal system. The first column of this file is `scenario_id`, representing the identity of each causal system. The mapping of `scenario_id` and causal systems are stored in `database/scenarios.csv`. Moreover, files for this causal system are stored in `database/{VGM_name}/{scenario_id}` folder. Other columns represents the evaluation result of each metrics for the corresponding causal system. The meaning of column names are same as in [final_res.json](#final_resjson).

### results_threshold_{VGM_name}.csv

This file is generated during [Evaluation for threshold-based metrics for rule consistency](#evaluation-for-threshold-based-metrics-for-rule-consistency) step, containing evaluation results of threshold-based metrics for each causal system. The first column of this file is `scenario_id`, representing the identity of each causal system. The mapping of `scenario_id` and causal systems are stored in `database/scenarios.csv`. Moreover, files for this causal system are stored in `database/{VGM_name}/{scenario_id}` folder. Other columns represents the evaluation result of each metrics for the corresponding causal system. The meaning of column names are same as in [final_res_threshold.json](#final_res_thresholdjson).


## Files for each causal system

The files are stored in `database/{VGM_name}/{scenario_id}` folder. For example, for VGM with name `Pika` and `scenario_id=3`, the files are stored in `database/Pika/3`.

### all_samples.csv

This file contains all samples where prompts are generated by only specifying $\mathbf{X}$. Let $\mathbf{V}$ denote the set of all variables in this causal system, and let $m=|\mathbf{V}|$. This file has $2m+2$ columns: `sample_id`, `prompt`, and `true_{name}`,`observed_{name}` for the `{name}` of each variable $V\in \mathbf{V}$. Each row represents a sample. The meaning of columns are:

- `sample_id`: the unique identity for each sample. It is used for register samples for level 2 (generation consistency, [sample_index_level_2.csv]) and level 3 (rule consistency, [sample_{non_root_name}.csv]).
- `prompt`: a string containing the generated prompt for this sample. The prompt specifies the value of all root variables $X\in \mathbf{X}$ as `true_{name}` for the `{name}` of each root variable.
- `true_{name}`: the true value for each variable $V\in\mathbf{V}$ for this sample. 
- `observed_{name}`: the observed value for each variable from the generated video. These columns are filled in the [Ask videos and observe variables](#ask-videos-and-observe-variables) step. Empty cells correspond to N/A results during the answer retrieving step.
- `metric2` and `metric3` (if exist): deprecated.


### basic_info.json

This file contains the basic information for the corresponding causal system. The keys and values of this file are:

- `roots`: a list of strings containing names of root variables $X\in \mathbf{X}$.
- `non_roots`: a list of strings containing names of non-root variables $Y\in \mathbf{Y}$.
- `rules`: a dictionary with name of non-root variables as keys. Each value is a list of dictionaries. They construct a disjunctive normal form under which the value of the non-root variable should be True.

### evaluate_results.csv

This file contains the evaluation results for sample-based metrics. Let $\mathbf{V}$ denote the set of all variables in this causal system, and let $m=|\mathbf{V}|$. The first $2m+2$ columns in this file are the same with [all_samples.csv](#all_samplescsv). The next 6 columns corresponds to metrics in [final_res.json](#final_resjson).

### prompts_{scenario_id}.txt
This file contains all prompts for this causal system. It firstly list all prompts in [all_samples.csv](#all_samplescsv) by the order of `sample_id`, then list all prompts in [sample_text_consistency.csv](#sample_text_consistencycsv) by the order of `sample_id`. This file is generated and also copied to `prompts/prompts_{scenario_id}.txt` in the [Register scenarios and generate examples](#register-scenarios-and-generate-examples) step.

### save_paths_{scenario_id}.txt
Deprecated. Only used in the codes for handling generated videos.

### sample_index_level_2.csv

This file contains the groups and `sample_id`s of samples for each group, which are used to evaluate generation consistency. It contains two columns: `group_id` and `sample_id`. Each row shows that the sample with id `sample_id` defined in [all_samples.csv](#all_samplescsv), is contained in the group with id `group_id`.

### sample_text_consistency.csv

This file contains all samples where prompts are generated by specifying both $\mathbf{X},\mathbf{Y}$. All columns in this file are same with [all_samples.csv](#all_samplescsv). The column `metric1` is deprecated.

### sample_{non_root_name}.csv

These files contains the sample used for rule consistency. For each non-root variable $Y\in \mathbf{Y}$, whose name is `{non_root_name}`, its samples are stored in `sample_{non_root_name}.csv`. Each file contains two columns: `False` and `True`. Each line is a list of integers, which corresponds to the `sample_id` column in [all_samples.csv](#all_samplescsv). The column `False` contains samples such that the true value of $Y$ is False, and the column `True` contains samples such that the true value of $Y$ is True.

### sample_results

This folder contains evaluation results for sample-based metrics for each sample in [all_samples.csv](#all_samplescsv). The name of json files corresponds to the `sample_id` column in [all_samples.csv](#all_samplescsv). The values are same as the last 6 columns in [evaluate_results.csv](#evaluate_resultscsv).

### videos

This folder contains all videos generated by the VGM in the [Generate videos](#generate-videos) step and the corresponding answer retrieved in the [Ask videos and observe variables](#ask-videos-and-observe-variables) step. It contains two folders: `rule` and `text`. The `rule` folder contains videos for level 2 (generation consistency) and level 3 (rule consistency), i.e. samples in [all_samples.csv](#all_samplescsv). The `text` folder contains videos for level 1 (text consistency), i.e. samples in [samples_text_consistency.csv](#sample_text_consistencycsv). Both folders are arraged in the following format:

Suppose the information for this causal system is contained in `config/samples/x_x.json`, and `comps_rule` is the value of `compositions` in `config/samples/x_x.json`, `comps_text` is the value of `compositions` in `config/samples_with_results/x_x.json`. Then, both `comps_rule` and `comps_text` are a list of dictionaries. Each dictionary in these lists has keys `factor_value`, `result_value` and `samples`. 

First, we convert each `factor_value` into a number string, where `False` corresponds to `0` and `True` corresponds to `1`. For example, if `factor_value=[false, true, true]`, then the corresponding number string is `011`. We use these number strings to name the folders in `videos/rule` and `videos/text`.

Then, for each dictionary in `comps_rule` and `comps_text`, its `samples` term is a list of several prompts. Suppose there are $k$ prompts, numbered as $0,1,\dots,k-1$. The next folder of `videos/rule` and `videos/text` are named as these numbers. Here are two examples. Firstly, consider prompts that only specify $\mathbf{X}$, with `factor_value=[false, true, true]`, and the prompt is the fourth one in the `samples` term, which is numbered as $3$. Then, the videos for this prompt are saved in `videos/rule/011/3`. Secondly, consider prompts that specify both $\mathbf{X},\mathbf{Y}$, with `factor_value=[true, true, false]`, and the prompt is the eighth one in the `samples` term, which is numbered as $7$. Then the videos for this prompts are saved in `videos/text/110/7`.

Here, the term `factor_value` corresponds to the true values of $\mathbf{X}$. The order of variables corresponds to the `roots` term in the JSON file, which is same with the order of `true_{name}` columns in [all_samples.csv](#all_samplescsv) and [sample_text_consistency.csv](#sample_text_consistencycsv).

Finally, there may be multiple videos generated with the same prompt. They are named with integers in order. For example, three videos with the same prompt, saved in folder `videos/rule/10/0`, respectively has path `videos/rule/10/0/0.mp4`, `videos/rule/10/0/1.mp4`, `videos/rule/10/0/2.mp4`. To relate samples with videos, all samples with the same prompt corresponds to all videos for that prompt, and a sample with smaller `sample_id` corresponds to a video with smaller name.
