Keywords: alignment, text-to-video generation, large language model, large vision model
Abstract: To mitigate the risk of harmful outputs from large vision models (LVMs), we introduce the *SafeSora* dataset to promote research on aligning text-to-video generation with human values. This dataset encompasses human preferences in text-to-video generation tasks along two primary dimensions: helpfulness and harmlessness. To capture in-depth human preferences and facilitate structured reasoning by crowdworkers, we subdivide helpfulness into 4 sub-dimensions and harmlessness into 12 sub-categories, serving as the basis for pilot annotations. The *SafeSora* dataset includes 14,711 unique prompts, 57,333 unique videos generated by 4 distinct LVMs, and 51,691 pairs of preference annotations labeled by humans. We further demonstrate the utility of the *SafeSora* dataset through several applications, including training the text-video moderation model and aligning LVMs with human preference by fine-tuning a prompt augmentation module or the diffusion model. These applications highlight its potential as the foundation for text-to-video alignment research, such as human preference modeling and the development and validation of alignment algorithms. Our project is available at https://sites.google.com/view/safe-sora.
Warning: this paper contains example data that may be offensive or harmful.
Submission Number: 1153
Loading