Track: Long Paper Track (up to 9 pages)
Keywords: Video understanding, video-language foundation model, action recognition, multi-modal learning
TL;DR: We provide a large-scale study on evaluating current vision-language foundation models focusing on transfer-learning onto video understanding task.
Abstract: Vision-Language foundation models, including vision-language models (VLMs) and vision-large language models (VLLMs), have been evolving rapidly and have shown good performance on different downstream video understanding tasks, especially on web datasets. However, it is still an open question how much these VLMs and VLLMs perform in more challenging scenarios like Activities of Daily Living (ADL). To answer this, we provide a comprehensive study of VLMs and VLLMs by comparing their zero-shot transfer ability to five downstream tasks including action classification, video retrieval, video description, action forecasting, and frame-wise action segmentation. Extensive experiments are conducted on eleven real-world, human-centric video understanding datasets (e.g., Toyota Smarthome, Penn Action, UAV-Human, EgoExo4D, TSU, Charades) to study these tasks with our insights into the strengths and limitations of these models in zero-shot settings. Moreover, we provide in-deep analysis to find the best setting to improve the model performance in zero-shot action classification tasks. Based on our experiments, we find that these models are still far away from satisfactory performance in all evaluated tasks, particularly in densely labeled and long video datasets.
Supplementary Material: pdf
Submission Number: 33
Loading