AbsText2Video: Embracing Abstract Annotations to Caption Video Dataset

Published: 13 Dec 2024, Last Modified: 19 Feb 2025Good-DataEveryoneRevisionsBibTeXCC BY 4.0
Student Lead Author Indication: Yes
Keywords: Dataset, Text-to-Video Dataset, Video Generation, LLM
TL;DR: A video dataset with abstract text annotations.
Abstract: While text-to-video generation (T2V) methods have achieved astonishing success thanks to the advancement in large-scale T2V datasets, they suffer from a sharp performance drop on abstract description input. On the one hand, this is due to the lack of abstract text-to-video pairs in existing training data. On the other hand, it also stems from the ill-posed nature of the abstract text. There are many possible concrete texts corresponding to the same abstract text. More importantly, abstract language occupies a large proportion (over 70%) of our daily communication. To address this issue, we propose an LLM-based abstract text annotation pipeline that dynamically updates prompts based on the generation quality. In addition, we also propose the cycle similarity metric to measure the similarity between concrete and abstract text pairs. Finally, we introduce a new AbsText2Video dataset to push the video generation to a broader range of applications. Experiments on 11 T2V models verify the effectiveness of our dataset in tackling the abstract texts.
Submission Number: 24
Loading