Keywords: video-language models, domain-specific action recognition, large-scale video data, video benchmarks
TL;DR: Can your video model recognize a "triple lutz jump" vs. a "triple flip jump" in Figure Skating? How about for 1000 actions across 38 domains?
Abstract: Videos are unique in their ability to capture *actions* which transcend multiple frames. Accordingly, action recognition has long been a quintessential task for video models. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern video-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on **domain-specific** actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering **1087 distinct actions from 38 domains**. VLMs struggle immensely on VideoNet, with Gemini 2.5 Pro performing only $15.8$ percentage points better than random chance. To improve model performance we provide in-context demonstrations, but only see a $3$% boost in VLM performance compared to a $13$% increase in non-expert human accuracy, suggesting that VLMs are poor few-shot learners. At last, we collect a large-scale training dataset containing nearly 500k video question-answer pairs. Fine-tuning an open-weight 4B model on our data, we surpass all Gemini models on the VideoNet benchmark. We release all of our data, inviting the community to explore new techniques to improve domain-specific action recognition capabilities and few-shot learning in video models.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 27
Loading