Towards Trustworthy Autonomous Vehicles with Vision-Language Models Under Targeted and Untargeted Adversarial Attacks

Published: 01 Jan 2025, Last Modified: 12 Nov 2025CVPR Workshops 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The integration of autonomous vehicles (AVs) into the transportation sector promises a transformative impact on mobility, safety, and efficiency. Yet, deploying these advanced systems in dynamically complex and unpredictable real-world environments presents substantial challenges, particularly in safeguarding their operational integrity against adversarial attacks. This paper rigorously examines the robustness of Vision-Language Models (VLMs) within AVs, emphasizing their resilience to both targeted and untargeted adversarial threats. We comprehensively evaluate four different vision encoders: CLIP, TeCoA, FARE, and Sim-CLIP. These models are assessed for their ability to maintain accurate and reliable performance when subjected to adversarial manipulations, tailoring a carefully preprocessed dataset designed to generate semantically detailed scene descriptions for enhanced caption generation. This paper explores the performance of these models across various adversarial scenarios, establishing benchmarks for their capability to interpret complex multimodal inputs under subtle adversarial manipulations. The findings reveal notable variances in resilience across the models for various AV-based datasets, with Sim-CLIP outperforming others in terms of robustness and maintaining high accuracy under adversarial conditions.
Loading