Uncovering And Correcting Perception Model Weaknesses Using VLM-Based Analysis

ICLR 2026 Conference Submission20051 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Explainability, VLM, NuScenes, Automated Driving, Perception
Abstract: This paper tackles the challenge of improving automated driving perception systems, focusing on rare, complex, or novel scenarios that previously deployed models fail to handle and developers struggle to identify. To address this, we propose a novel two-stage method (SPIDER ) to diagnose and resolve perception model insufficiencies using Vision Language Models (VLMs). In the first stage, we segment data in a semantic embedding space to identify regions containing visually similar samples that differ in detection performance. By comparing these high- and low-performance subsets, we use a VLM to extract semantic effects — interpretable factors correlated with model errors. In the second stage, these effects guide targeted data acquisition to improve the model. Samples representing the identified effects are selected, and the perception model is fine-tuned using this curated dataset. Evaluations on the NuScenes dataset demonstrate that SPIDER can effectively identify insufficiencies in the perception model and quantifies key parameters. SPIDER enhances model robustness and improves transparency and explainability, which are critical for safety in automated driving systems.
Primary Area: interpretability and explainable AI
Submission Number: 20051
Loading