Abstract: Highlights•Two proposed tasks aim to predict relations between action classes.•The ground-truth relations are provided from Meta Video Dataset (MetaVD).•Recent pre-trained models in NLP and CV are useful in the tasks.•Action label texts contribute to higher predictive performance than videos.•Using both action label texts and videos can improve the performance.
Loading