Abstract: Text-video joint models that are trained on state-of-the-art text-video datasets perform well on general text-video datasets that contain general actions. However domain-specific datasets are hard to collect because of data collection and labeling costs. In this work, we propose a pipeline to automatically extend YouCook2 that is collected to define and retrieve cooking videos. Related videos are found with 89 recipe categories, then segments are prepared by random sampling, and segment-text pairs are obtained by a captioning model. Experiments on extended dataset show that automatically created dataset increases recall results. Code and data are made available at https://github.com/EmreOzkose/hucook
Loading