Task Interference in VLMs for Autonomous Driving: When Better Perception Hurts Planning

Published: 06 Nov 2025, Last Modified: 06 Nov 2025AIR-FM PosterEveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Keywords: Vision Language Models
TL;DR: Fine-tuning VLMs to improve scene understanding significantly degrades their planning performance, revealing a fundamental challenge for deploying VLMs.
Abstract: Vision-language foundation models (VLMs) show strong potential in autonomous driving for scene understanding and decision-making, yet their cross-task performance remains inconsistent. This work presents a systematic study of task interference in VLMs for autonomous driving, revealing a key trade-off: fine-tuning for better perception often degrades planning accuracy. We introduce an evaluation framework that decouples perception and planning to measure interference precisely. Using a multi-source question-answering dataset from diverse driving datasets, we fine-tune state-of the-art VLMs on action descriptions. While fine-tuned models improve decision explanation quality, they exhibit measurable declines in planning compared to zero-shot counterparts. Experiments across multiple architectures confirm this perception–planning trade-off as a general phenomenon driven by attention conflicts and representation divergence. Our findings provide the first empirical validation of foundation model interference in autonomous driving and highlight critical implications for reliable deployment in safety-critical environments.
Submission Track: Workshop Paper Track
Submission Number: 4
Loading