Alignment of Image-Text and Video-Text Datasets

Published: 01 Jan 2023, Last Modified: 21 Oct 2024SIU 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In this study, the alignment of video-text and imagetext datasets is studied. Firstly, similarities are calculated over the texts in the two data sets. A retrieval setup with visual similarities is then applied to the subset which is created via calculated text similarities. A BERT-based embedding vector method is applied to the raw and pure texts. As a visual feature, object-based and CLIP-based methods are used to define video frames. According to the results, alignment with CLIP features achieves the best results in the subset created by filtering using raw text.
Loading