VLM-Enhanced Adversarial Scene Generation from Images and Videos for Safe Autonomous Driving

Tianyi Liu; Yingjie Xu; Yinlong Liu

VLM-Enhanced Adversarial Scene Generation from Images and Videos for Safe Autonomous Driving

Tianyi Liu, Yingjie Xu, Yinlong Liu

Published: 27 Nov 2025, Last Modified: 28 Nov 2025E-SARS OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Safe Autonomous Driving, Vision-language Model, Adversarial Scene Generation

Abstract: Safety-critical scenario generation is essential for evaluating autonomous vehicles, yet existing approaches often require extensive manual design and lack scalability. This work proposes an automated framework that combines vision–language models (VLMs) and large language models (LLMs) to generate realistic safety-critical driving scenarios from naturalistic driving videos. We propose a three-mode generation framework that transforms both accident and normal traffic videos into adversarial scenarios while preserving dataset-specific distributions and cultural driving patterns. Our pipeline first employs a VLM to convert input videos into structured scene descriptions capturing road geometry, traffic participants, and their interactions with the ego vehicle. These descriptions are then translated into executable Scenic programs, supported by an LLM-based error-correction module that ensures executable code and stable simulation in CARLA. We evaluate the framework using SafeBench across eight challenging base scenario categories and test the generated scenarios against three reinforcement-learning driving agents. The results demonstrate that our method produces diverse and realistic adversarial situations, improving scenario variety, realism, and failure coverage compared to baseline approaches. Overall, this work shows that integrating VLMs and LLMs enables scalable generation of safety-critical scenarios, offering a promising tool for more robust and comprehensive autonomous-driving evaluation.

Submission Number: 11

Loading