Fine-tuning Vision-Language Models for Animal Behavior Analysis

Sepideh Mamooler; Haozhe Qi; Valentin Gabeff; Syrielle Montariol; Antoine Bosselut; Alexander Mathis

Fine-tuning Vision-Language Models for Animal Behavior Analysis

Sepideh Mamooler, Haozhe Qi, Valentin Gabeff, Syrielle Montariol, Antoine Bosselut, Alexander Mathis

Published: 31 Jul 2025, Last Modified: 16 Aug 2025LM4SciEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision language models, video understanding, ethology, neuroscience

TL;DR: We develop a new framework that transforms existing datasets into multi-task code-based annotations. Fine-tuning a state-of-the-art VLM on this data outperforms vision-only models and larger zero-shot VLMs.

Abstract: Animal behavior analysis is fundamental to ethology, behavioral ecology, and neuroscience. Current methods typically use vision-only classifiers, which are task-specific and limited to closed-vocabulary classification paradigms. Vision-language models (VLMs) show strong video question-answering (VideoQA) performance across domains but remain underexplored for animal behavior. We present a novel framework that converts existing datasets into a comprehensive multi-task VideoQA dataset with code-based solutions without extra annotation. Fine-tuning InternVL3-8B on this dataset, we achieve up to 33.2 and 26.9 percentage point improvement over supervised vision-only baselines and zero-shot VLMs with 10× more parameters, respectively. Our systematic evaluation demonstrates the superiority of vision-language approaches and advances interpretable, code-based predictions for behavioral analysis.

Submission Number: 26

Loading