Keywords: vision language models, video understanding, ethology, neuroscience
TL;DR: We develop a new framework that transforms existing datasets into multi-task code-based annotations. Fine-tuning a state-of-the-art VLM on this data outperforms vision-only models and larger zero-shot VLMs.
Abstract: Animal behavior analysis is fundamental to ethology, behavioral ecology, and neuroscience.
Current methods mostly use vision-only classifiers, which are task-specific and limited to closed-vocabulary classification paradigms.
Vision-language models (VLMs) show strong video question-answering (VideoQA) performance across domains but remain underexplored for animal behavior.
We present a novel framework that converts existing datasets into a comprehensive multi-task VideoQA dataset with code-based solutions without extra annotation.
Fine-tuning InternVL3-8B on this dataset, we achieve up to 33.2 and 26.9 percentage point improvement over supervised vision-only baselines and zero-shot VLMs with 10× more parameters, respectively.
Our systematic evaluation demonstrates the superiority of vision-language approaches and advances interpretable, code-based predictions to enhance scientific insight.
Submission Number: 26
Loading