Keywords: Autonomous driving, reasoning model, VLM, Remote Driving, teleoperation, alignment
TL;DR: We distill a driving-specialized 8B VLM into a lightweight 2B student via Safety-Aware DPO to proactively flag teleoperation-critical scenes, achieving 92% recall on intervention-required events while running at 8–10 FPS per feed.
Abstract: Recent advances in autonomous driving enable vehicles to operate in increasingly complex environments, yet safe deployment still requires timely human intervention when the autonomous system encounters ambiguous or safety-critical situations. Existing remote driving systems rely on heuristic or confidence-based triggers to request assistance, which lack semantic understanding of driving context and often cause over- or under-escalation. We propose a vision–language reasoning-driven escalation framework that enables autonomous vehicles to request remote driving assistance based on context-aware semantic reasoning rather than numerical uncertainty. Our approach uses a vision–language model (VLM) aligned using Direct Preference Optimization (DPO) on pairwise operator preferences to calibrate escalation decisions and reduce unnecessary or missed escalations. The model interprets visual observations and driving context to generate structured explanations that help operators quickly understand and respond to situations. We evaluate the approach using NVIDIA Cosmos-2 across diverse uncertainty scenarios, including intersection negotiation and dynamic obstacle ambiguity. Results show improved intervention efficiency, reduced unnecessary operator engagement, and enhanced safety compared to heuristic triggering, highlighting a scalable pathway toward reliable human-in-the-loop autonomous driving.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 25
Loading