Here we showcase some videos within wanx2.1 and VideoCrafter2(Vc2) to present the effect of our methods. 



## Comparison

In comparison-wanx, the prompt of each video denote as follows, the videos showcased in that folder, from left to right, are baseline sft video, dpo, and our method. Our method exhibits superior consistency in tone and a more realistic visual representation, while also maintaining better alignment between text and video content.

> case1: A man picked up a water cup to drink.
> case2: A man dressed in a suit is adjusting his tie.
> case3: Two woman sit before a table.
> case4: A young, fair-skinned woman with long blonde hair is standing outdoors on a sunny day, wearing a sun glasses. 
> case5: A woman is eating with chopticks.
> case6: An older man with short black hair wearing a white shirt, dark vest, and patterned tie stands looking at another person off-screen.
> case7: A sad Asian man slowly wiped away tears with his hands.



In comparison-vc2, the prompt of each video denote as follows, the videos showcased in that folder, from left to right, are baseline sft video, dpo, and our method.

> 0004.mp4: A fair-shinned man with short brown hair is wearing glasses and a blue suit.
> 0017.mp4: An Asian man with short black hair, tan skin, and a goatee is speaking.
> 0020.mp4: A young woman with long brown hair. She is wearing a black jacket.
> 0058.mp4: A young woman is holding a heart-shaped balloon with a blue shirt.
> 0072.mp4: A young Black woman with long, dark hair, She wears a green strapless top.
> 0086.mp4: A young woman with long hair is shown in profile view from the shoulders up.
> 0093.mp4: A young woman with logn blonde hair is putting on a white face mask.
>
> We radomly sample from test dataset to get these videos. The showcase videos from left to right, are baseline sft video, dpo, and our method.



## Ablation

In ablation folder, we show the effectiveness of each component, The videos showcased, from left to right, are baseline (denotes videoDPO), +IPR (in videos named win reward), +ARS (in videos named reweighted dpo) and our method (PG-DPO). The prompts of each video are as follows:

> wanx_1: A man in a blue suit is giving a speech.
> wanx_2: A man dressed in a suit is adjusting his tie.
> wanx_3: An East Asian man with short, dark hair pulled back from his face looks downward. 
> wanx_4: An older man with short black hair wearing a white shirt, dark vest, and patterned tie stands looking at another person off-screen.
>
> vc_1: An Asian man with short black hair is upset. **He wipes tears with one hand.**
> vc_2: **A person stands in front,** with others behind, all dressed in white lab coats.
> vc_3: A young Black woman with long, dark hair. She wears a **green** strapless top.

Notably, three vc cases are all show great text-video alignment enhancement: in vc_1, our method can better align to the prompt ‘wipes tears with one hand’, in vc_2, our method can better align to the prompt ‘person stands in front’ and in vc_3, our method can better align to ‘green strapless’.