Understanding Task Transfer in Vision-Language Models

Published: 23 Sept 2025, Last Modified: 17 Nov 2025UniReps2025EveryoneRevisionsBibTeXCC BY 4.0
Supplementary Material: pdf
Track: Extended Abstract Track
Keywords: VLM, perception, finetuning, task transfer
TL;DR: We measure how finetuning on one perception task transfers to others in VLMs, introduce a new metric (PGF), and uncover task cliques, asymmetries, and scale effects for better training design.
Abstract: Vision–Language Models (VLMs) have achieved strong performance across diverse multimodal benchmarks through multitask training. Yet, these models struggle on visual perception tasks, falling way behind human-level performance. These models are often finetuned on a multitude of tasks. However, it is unclear how finetuning on one perception task influences zero-shot performance on others, a question that is crucial for designing efficient training strategies. In this work, we study how finetuning on one perception task affects the model's performance on other perception tasks and present the first systematic study of task transferability in VLMs within the perception domain. We introduce Performance Gap Factor (PGF), a novel metric that quantifies transfer by jointly capturing its breadth (how many tasks are affected) and magnitude (the strength of influence). Using three open-weight VLMs across 13 perception tasks, we construct a task graph that uncovers inter-task relationships previously unexplored in the multimodal setting. Our analysis reveals distinct cliques of mutually beneficial as well as mutually detrimental tasks. We also categorise tasks into different personas based on their transfer properties. These findings highlight both opportunities for positive transfer and risks of negative interference, offering actionable guidance for curating finetuning strategies and advancing general-purpose VLMs.
Submission Number: 139
Loading