Fine-tuning Vision-Language Models for Animal Behavior Analysis

Published: 31 Jul 2025, Last Modified: 31 Jul 2025LM4SciEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision language models, video understanding, ethology, neuroscience
TL;DR: We develop a new framework that transforms existing datasets into multi-task code-based annotations. Fine-tuning a state-of-the-art VLM on this data outperforms vision-only models and larger zero-shot VLMs.
Abstract: Animal behavior analysis is fundamental to ethology, behavioral ecology, and neuroscience. Current methods mostly use vision-only classifiers, which are task-specific and limited to closed-vocabulary classification paradigms. Vision-language models (VLMs) show strong video question-answering (VideoQA) performance across domains but remain underexplored for animal behavior. We present a novel framework that converts existing datasets into a comprehensive multi-task VideoQA dataset with code-based solutions without extra annotation. Fine-tuning InternVL3-8B on this dataset, we achieve up to 33.2 and 26.9 percentage point improvement over supervised vision-only baselines and zero-shot VLMs with 10× more parameters, respectively. Our systematic evaluation demonstrates the superiority of vision-language approaches and advances interpretable, code-based predictions to enhance scientific insight.
Submission Number: 26
Loading