Keywords: vision language models, video understanding, ethology, neuroscience
TL;DR: We develop a new framework that transforms existing datasets into multi-task code-based annotations. Fine-tuning a state-of-the-art VLM on this data outperforms vision-only models and larger zero-shot VLMs.
Abstract: Animal behavior analysis is fundamental to ethology, behavioral ecology,
and neuroscience. Current methods typically use vision-only classifiers,
which are task-specific and limited to closed-vocabulary classification
paradigms. Vision-language models (VLMs) show strong video
question-answering (VideoQA) performance across domains but remain
underexplored for animal behavior. We present a novel framework that
converts existing datasets into a comprehensive multi-task VideoQA
dataset with code-based solutions without extra annotation. Fine-tuning
InternVL3-8B on this dataset, we achieve up to 33.2 and 26.9 percentage
point improvement over supervised vision-only baselines and zero-shot
VLMs with 10× more parameters, respectively. Our systematic evaluation
demonstrates the superiority of vision-language approaches and advances
interpretable, code-based predictions for behavioral analysis.
Submission Number: 26
Loading